要遍历pyspark dataframe的行并应用UDF,可以按照以下步骤进行。
首先,导入必要的库和模块:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
接下来,创建一个SparkSession:
spark = SparkSession.builder.appName("UDF Example").getOrCreate()
然后,定义一个UDF(User Defined Function):
square_udf = udf(lambda x: x*x)
现在,创建一个示例DataFrame:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
接下来,遍历DataFrame的行并应用UDF:
for row in df.rdd.collect():
name = row.Name
age = row.Age
squared_age = square_udf(age)
print(f"Name: {name}, Age: {age}, Squared Age: {squared_age}")
最后,停止SparkSession:
spark.stop()
完整的示例代码如下:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
# 创建SparkSession
spark = SparkSession.builder.appName("UDF Example").getOrCreate()
# 定义UDF
square_udf = udf(lambda x: x*x)
# 创建示例DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
# 遍历DataFrame的行并应用UDF
for row in df.rdd.collect():
name = row.Name
age = row.Age
squared_age = square_udf(age)
print(f"Name: {name}, Age: {age}, Squared Age: {squared_age}")
# 停止SparkSession
spark.stop()
这样,你就可以遍历pyspark dataframe的行并应用UDF了。