The amount of information is growing minute by minute and storing the volumes of data is paramount for any analytics or business intelligence. Enterprises are now generating the DataLake to consolidate all their federated data to a single location.
The ETL (Extract Transform and Load) process is key functionality to having a well-structured process for the data lake.
AWS provides various services for data transfer and AWS Glue is the prime service for their ETL offering. AWS Glue studio is also made available to have a graphical user interface to ease the ETL process.
In this document, we will demonstrate how to integrate MongoDB Atlas with the AWS Glue services. We will show a practical guide for loading the data from S3 through AWS Glue Crawler, Mapping, and Data Catalog services to MongoDB Atlas.
This can be extended to any of the source connectors of AWS GLue like CSV, XLS, Text, RDBMS, Stream data, etc.
This article is to demonstrate the capabilities of MongoDB Atlas and AWS Glue Studio Integration.
MongoDB Atlas is an all purpose database having features like Document Model, Geo-spatial , Time-seires, hybrid deployment, multi cloud services. It evolved as "Developer Data Platform", intended to reduce the developers workload on development and management the database environment. It also provide a free tier to test out the application / database features.
AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.
With AWS Glue Studio, we can now create scripts for integrations with all the data source. In this module, we utilized the MongoDB Atlas's Spark connectors to connect to the MongoDB Atlas. As of now there are no director connectors are available in AWS Glue Studio to connect to MongoDB Atlas.
Reference: Link
Setup the VPC, Subnet, NAT Gateway and VPC Endpoints
Login to the AWS console and search for VPC
Click on the VPC and click on the "Create VPC"
Select the "VPC and more" and choose the parameters as shown.
Click on the "Security groups" option on the left side menu.
Ensure the security group routes are enabled for the accessing the database ports
Please follow the link to setup a free cluster in MongoDB Atlas
Configure the database for network security and access.
Reference: Link
Setup the private link services to secure the connection between MongoDB Atlas and AWS VPC
Select the "Network Access" option from the side menu
Cick on the "Private Endpoint" and select "Dedicated Cluster"
Select the "Add Private Endpoint" and follow the instruction to create the endpoint. Note, there are some AWS CLI commands to be executed for this creation
Ensure the Private Endpoint is created successfully, before moving to the next step.
Upload the sample airport.json file to the S3 bucket
Click on the Job from the menu and select "Pyspark Scripts"
Click "Create Job"
Copy the Code from the link and paste
Configure the parameters
Save the job and click "Run"
Ensure the job ran successfully. You can refer the logs for the troubleshooting
Validate the S3 data are created as a document in MongoDB Atlas
Hope this technical guide helped you in migrating the data into the MongoDB Atlas cluster using the AWS Glue services.
This solution can be extended to any of the AWS Glue compatible source data , and the same can be migrated to MongoDB Atlas.
For any assistance please reach out to [email protected]