Skip to content
#

aws-emr

Here are 57 public repositories matching this topic...

aws-data-pipeline

A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locally hosted Airflow containers. The end product is a Superset dashboard and a Postgres database, hosted on an EC2 instance at this address (powered down):

  • Updated May 14, 2022
  • Python

Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.

  • Updated Mar 17, 2023
  • Python

We Build an ETL pipeline using Airflow that accomplishes the following: Downloads data from an AWS S3 bucket, Runs a Spark/Spark SQL job on the downloaded data producing a cleaned-up dataset of delivery deadline missing orders and then Upload the cleaned-up dataset back to the same S3 bucket in a folder primed for higher level analytics

  • Updated Feb 25, 2023
  • Python

Improve this page

Add a description, image, and links to the aws-emr topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the aws-emr topic, visit your repo's landing page and select "manage topics."

Learn more