Skip to content

To demonstrate the integration between MongoDB Atlas and AWS Glue Studio

License

Notifications You must be signed in to change notification settings

Babusrinivasan76/atlasgluestudiointegration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 

Repository files navigation

ETL with MongoDB Atlas and AWS Glue Studio (through Pyspark scripts)

Introduction

The amount of information is growing minute by minute and storing the volumes of data is paramount for any analytics or business intelligence. Enterprises are now generating the DataLake to consolidate all their federated data to a single location.

The ETL (Extract Transform and Load) process is key functionality to having a well-structured process for the data lake.

AWS provides various services for data transfer and AWS Glue is the prime service for their ETL offering. AWS Glue studio is also made available to have a graphical user interface to ease the ETL process.

In this document, we will demonstrate how to integrate MongoDB Atlas with the AWS Glue services. We will show a practical guide for loading the data from S3 through AWS Glue Crawler, Mapping, and Data Catalog services to MongoDB Atlas.

This can be extended to any of the source connectors of AWS GLue like CSV, XLS, Text, RDBMS, Stream data, etc.

This article is to demonstrate the capabilities of MongoDB Atlas and AWS Glue Studio Integration.

MongoDB Atlas is an all purpose database having features like Document Model, Geo-spatial , Time-seires, hybrid deployment, multi cloud services. It evolved as "Developer Data Platform", intended to reduce the developers workload on development and management the database environment. It also provide a free tier to test out the application / database features.

AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job.

Integration Features

With AWS Glue Studio, we can now create scripts for integrations with all the data source. In this module, we utilized the MongoDB Atlas's Spark connectors to connect to the MongoDB Atlas. As of now there are no director connectors are available in AWS Glue Studio to connect to MongoDB Atlas.

Steps for Integration

1.Setup of AWS Network components

Reference: Link

Setup the VPC, Subnet, NAT Gateway and VPC Endpoints

Login to the AWS console and search for VPC Click on the VPC and click on the "Create VPC" Select the "VPC and more" and choose the parameters as shown.

Click "Create".

2.Setup of Security Group

Click on the "Security groups" option on the left side menu.

Ensure the security group routes are enabled for the accessing the database ports

3.Set up the MongoDB Atlas cluster

Please follow the link to setup a free cluster in MongoDB Atlas

Configure the database for network security and access.

4.Setup the Private link

Reference: Link

Setup the private link services to secure the connection between MongoDB Atlas and AWS VPC

Select the "Network Access" option from the side menu

Cick on the "Private Endpoint" and select "Dedicated Cluster"

Select the "Add Private Endpoint" and follow the instruction to create the endpoint. Note, there are some AWS CLI commands to be executed for this creation

Ensure the Private Endpoint is created successfully, before moving to the next step.

5.Upload the sample JSON file to S3 bucket

Upload the sample airport.json file to the S3 bucket

6.Setup the AWS Glue Catalog

Create the connection to the S3 bucket

Validate the connection by clicking the "Test Connection"

Create a crawler that runs on demand to pull the data from S3 to Glue Database

Run the crawler and validate the database and table after successful completion of the crawler

7.Create a Glue Studio Job and run

Click on the Job from the menu and select "Pyspark Scripts"

Click "Create Job"

Copy the Code from the link and paste

Configure the parameters

Save the job and click "Run"

Ensure the job ran successfully. You can refer the logs for the troubleshooting

8.Validate the Data in MongoDB Atlas

Validate the S3 data are created as a document in MongoDB Atlas

Summary

Hope this technical guide helped you in migrating the data into the MongoDB Atlas cluster using the AWS Glue services.

This solution can be extended to any of the AWS Glue compatible source data , and the same can be migrated to MongoDB Atlas.

For any assistance please reach out to [email protected]

About

To demonstrate the integration between MongoDB Atlas and AWS Glue Studio

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages