Migrate Relational Databases to Amazon S3 using AWS Glue

Sunday, November 25, 2018 by Ujjwal Bhardwaj

AWS Glue is a fully managed ETL service provided by Amazon that makes it easy to extract and migrate data from one source to another whilst performing a transformation on the source data. AWS Glue is a combination of multiple microservices that works great together in cahoots as well as can be individually integrated with other services.

Among these microservices is Glue Connections which is used to connect and access certain types of source and target data stores. Using these connections, a Glue ETL job can extract data from a data source or write to it depending on the use case.

The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. This metadata is extracted by Glue Crawlers which connects to a data store using Glue connection, crawls the data for its meta information and extract the schema and other statistics.

In this blog post, I describe how we can leverage the use of these Glue microservices to easily migrate data from a Relational database to Amazon S3. In the example, we connect AWS Glue to an RDS instance for data migration.

Add a Glue connection with connection type as Amazon RDS and Database engine as MySQL, preferably in the same region as the datastore, and then set up access to your data source.
Add a Crawler with "JDBC" data store and select the connection created in step 1. In the include path for the crawler, add the database name followed by the per cent (%) character. For instance, provide "mydb/%" in the include path for a database with name "mydb".
When asked to "Configure the crawler's output" add a new database. This database will contain table definitions required to extract the data of all the tables from the source database.
Create a Glue ETL job that runs "A new script to be authored by you" and specify the connection created in step 1.
Replace the pre-generated script with the following script and update the following variables:

aws_region: Region code for the region with your Glue service (eg., ap-south-1)
s3_path: The S3 path to the location where you would like your data to be migrated to;
glue_database: Glue Data Catalog database created by the crawler that crawled the data source

target_format: The format in which you want your data to be written to (csv | parquet etc.)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

aws_region = "<your-aws-region-code>"
s3_path = "<your-s3-prefix>"
glue_database = "<your-glue-database-name>"
target_format = "csv"

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

client = boto3.client(service_name='glue', region_name=aws_region)
responseGetTables = client.get_tables(DatabaseName=glue_database)

tableList = responseGetTables['TableList']
tables = []
for tableDict in tableList:
  tables.append(tableDict['Name'])

for table in tables:
  datasource = glueContext.create_dynamic_frame.from_catalog(database = glue_database, table_name = table)
  datasink = glueContext.write_dynamic_frame.from_options(frame = datasource, connection_type = "s3", connection_options = {"path": s3Path + table}, format = target_format)

job.commit()

This ETL script leverages the use of AWS Boto3 SDK for Python to retrieve information about the tables created by the Glue Crawler. It then loops through the list of tables and creates DynamicFrames from these tables, consequently writing them to S3 in the specified format.

If you have questions or suggestions, please leave a comment following.