Rename AWS Glue Job Output File
Glue is an Amazon provided and managed ETL platform that uses the open source Apache Spark behind the back. When you write a DynamicFrame ton S3 using the write_dynamic_frame()
method, it will internally call the Spark methods to save the file. Since Spark uses the Hadoop File Format, we see the output files with the prefix part-00
in their name.
In a use case where you need to write the output of your ETL job to a single file with a custom name, you may fer the follwing code to rename the files from S3 using the boto3 APIs
BUCKET_NAME = <bucket-name>
PREFIX = <prefix-name>
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="default", table_name="table_x")
dataF = datasource0.toDF().coalesce(1)
from awsglue.dynamicframe import DynamicFrame
DyF = DynamicFrame.fromDF(dataF, glueContext, "DyF")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = DyF, connection_type = "s3", connection_options = {"path": "s3://" + BUCKET_NAME + "/" + PREFIX}, format = "json", transformation_ctx = "datasink2")
import boto3
client = boto3.client('s3')
response = client.list_objects(
Bucket=BUCKET_NAME,
Prefix=PREFIX,
)
name = response["Contents"][0]["Key"]
client.copy_object(Bucket=BUCKET_NAME, CopySource=BUCKET_NAME+name, Key=PREFIX+"new_name")
client.delete_object(Bucket=BUCKET Key=name)