Rename AWS Glue Job Output File

Glue is an Amazon provided and managed ETL platform that uses the open source Apache Spark behind the back. When you write a DynamicFrame ton S3 using the write_dynamic_frame() method, it will internally call the Spark methods to save the file. Since Spark uses the Hadoop File Format, we see the output files with the prefix part-00 in their name.

In a use case where you need to write the output of your ETL job to a single file with a custom name, you may fer the follwing code to rename the files from S3 using the boto3 APIs

BUCKET_NAME = <bucket-name>
PREFIX = <prefix-name>

datasource0 = glueContext.create_dynamic_frame.from_catalog(database="default", table_name="table_x")
dataF = datasource0.toDF().coalesce(1)

from awsglue.dynamicframe import DynamicFrame
DyF = DynamicFrame.fromDF(dataF, glueContext, "DyF")

datasink2 = glueContext.write_dynamic_frame.from_options(frame = DyF, connection_type = "s3", connection_options = {"path": "s3://" + BUCKET_NAME + "/" + PREFIX}, format = "json", transformation_ctx = "datasink2")

import boto3
client = boto3.client('s3')

response = client.list_objects(
    Bucket=BUCKET_NAME,
    Prefix=PREFIX,
)
name = response["Contents"][0]["Key"]

client.copy_object(Bucket=BUCKET_NAME, CopySource=BUCKET_NAME+name, Key=PREFIX+"new_name")
client.delete_object(Bucket=BUCKET Key=name)

© 2019 | Ujjwal Bhardwaj. All Rights Reserved.