AwsGlue任务生成了很多小文件 _程序开发

AwsGlue任务生成了很多小文件

创始人

2024-09-25 16:01:52

0次

使用“合并文件”的技术，将多个小文件合并成一个大文件，以减少存储成本，并提高查询的性能。

示例代码如下：

from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import StringType, DoubleType, BooleanType

# 定义UDF将小文件合并成一个大文件
def merge_files(partition):
    # 通过monotonically_increasing_id函数为每一个行添加一个唯一的ID号码
    index = int(partition.first()["id"])
    filename = "/tmp/big_file_" + str(index) + ".csv"
    # 将小文件读取进来，合并成一个大文件
    with open(filename, "w") as f:
        for row in partition:
            f.write(",".join([str(x) for x in row]))
            f.write("\n")

# 定义结构和创建DF
schema = StructType([
  StructField("column1", IntegerType(), True),
  StructField("column2", StringType(), True),
  StructField("column3", DoubleType(), True),
  StructField("column4", BooleanType(), True)
])

input_df = spark.read.csv("s3://my_bucket/small_files/", header=True, schema=schema)

# 在DF上面调用UDF
output_df = input_df.repartition(10).sort("id").foreachPartition(merge_files)

这里我们使用了pyspark内置的函数 monotonically_increasing_id 给每一行数据添加了唯一的ID号码，并设定每个分区有10个文件。在执行 foreachPartition 操作之后，将调用我们自定义的UDF，将每个分区中的数据合并成一个大文件。最后，使用 spark.read.csv 函数从S3中读取这些大文件进行后续的数据分析处理。

上一篇：AWSGlue任务内存不足。

下一篇：AWSGlue任务未遵守超时时间并无法停止。

AwsGlue任务生成了很多小文件

相关内容

热门资讯