Skip to content
/ Projects Public

This Repo Contains Data Engineering Projects

Notifications You must be signed in to change notification settings

pnraj/Projects

Repository files navigation

Data Engineering Projects:

Project Description

This repository Contains collection of resources and code for the aspiring data engineer. It aims to provide a solid foundation and practical guidance for individuals interested in pursuing a career in the field of data engineering.

PROJECTS

API TO RDS USING LAMBDA WITH SLACK ERROR MONITORING

Project WorkFlow

  • Using AWS Lambda Api is fetched from a link, processed and load into AWS RDS with 15 seconds Interval
  • Two Lambda functions are used in these Pipeline where First lambda will be invoked by Aws Step-Function which is invoked by Cloudwatch / EventBridge Rules. For Every One minute until the Rule gets disabled.
  • Second Lambda Function is used to fetch Api and Loaded into AWS RDS.
  • Aws Step-Function is Working Based ASL(Amazon State Language) which is based on Json file Structure
  • If any error or Database connction problem occurs notification is sent to slack channel using slack_sdk
  • All internal Connections between AWS services are based on IAM Role and Policies.

SPARK-ENABLED EXTRACTION AND LOADING INTO AWS RDS

Project WorkFlow

  • There are Two Part in these Project.\n
  • Part1 is Getting Data from SEC.gov(Zip format) contains more than 8.5 lakh Json files around 6gb after uncompressing.
  • By Using Apache Spark(PySpark) and DataBricks, Json files are converted into Pyspark DataFrames with Each json File representing single row in DataFrame.The DataFrame is later converted into Json file and uploaded into AWS S3.
  • Part2 is Getting Data from AWS S3, Do the Needed Transformation and upload into AWS RDS-Mysql Instance
  • The Data From S3 is Converted into PySpark DataFrame and Isolate needed Columns that needed to uplaoded into RDS
  • Important Function Used for Transformation are join, posexplode_outer, udf, concat, to_date, struct and Row.

YouTube Data Harvesting and Warehousing

Project WorkFlow

  • Ability to input a YouTube channel ID and retrieve all the relevant data using Google API.
  • Option to store the data in a MongoDB database as a Data Lake.
  • Ability to collect data for up to 10 different YouTube channels and store them in the data lake based upon user Requirment.
  • Option to select a channel name and migrate its data from the data lake to a Mysql(SQL) Database as tables.
  • Ability to search and retrieve data from the SQL database using different search options, including joining tables to get channel details.

PhonePe Pulse Data Analysis 2018-2022

Phonepe

  • Getting the PhonePe Payment App-Data in Json format from Github repo
  • The Json files are separated for every 3 Month / 1 Quarter of years from 2018-2022 for every states and districts in India.
  • Using Python os module, Pipeline is Built to Iterate to each folder and get data from json file and convert into pandas DataFrame.
  • Json Files Contains Details about Amount of Transactions and Transaction Location where Users Do that Transaction.
  • Using The DataFrame, Visualization are made using Plotly and Streamlit on Geo, Bar, line, Pie, Area chart are included.

Twitter Scraping

Twitter scaping

  • Based on User needs Twitter Tweets are Extracted and Uploaded into Mongodb using UI based upon Streamlite based app
  • Users have to enter Tweets topic or hashtag, Starting Date, Ending Date, Total Number of Tweets needed to extracted in app and
  • App will fetch the data by using Snscrape and convert the data into Pandas DataFrame and displayed as Tabular Format.
  • After Checking the data users can have options to download the data as json, csv or can be uploaded into Mongodb.

License

This project is licensed under the MIT License. Please review the license file for more details.

Contact

If you have any questions or suggestions regarding this project, feel free to reach out to me at pnrajk@gmail.com.