Software Development Engineer

Blog PostsResume

Rename AWS Glue Job Output File

Glue is an Amazon provided and managed ETL platform that uses the open source Apache Spark behind the back. When you write a DynamicFrame ton S3 using the write_dynamic_frame() method, it will internally call the Spark methods to save the file. Since Spark uses the Hadoop File Format, we see the output files with the prefix part-00 in their name.

In a use case where you need to write the output of your ETL job to a single file with a custom name, you may fer the follwing code to rename the files from S3 using the boto3 APIs

BUCKET_NAME = <bucket-name>
PREFIX = <prefix-name>

datasource0 = glueContext.create_dynamic_frame.from_catalog(database="default", table_name="table_x")
dataF = datasource0.toDF().coalesce(1)

from awsglue.dynamicframe import DynamicFrame
DyF = DynamicFrame.fromDF(dataF, glueContext, "DyF")

datasink2 = glueContext.write_dynamic_frame.from_options(frame = DyF, connection_type = "s3", connection_options = {"path": "s3://" + BUCKET_NAME + "/" + PREFIX}, format = "json", transformation_ctx = "datasink2")

import boto3
client = boto3.client('s3')

response = client.list_objects(
    Bucket=BUCKET_NAME,
    Prefix=PREFIX,
)
name = response["Contents"][0]["Key"]

client.copy_object(Bucket=BUCKET_NAME, CopySource=BUCKET_NAME+name, Key=PREFIX+"new_name")
client.delete_object(Bucket=BUCKET Key=name)

© 2024 Ujjwal Bhardwaj. All Rights Reserved.