Skip to content

UserInsight-Streaming-Data-Pipeline is a real-time pipeline that ingests API data into Kafka, processes it with Spark, stores it in S3, and uses AWS Lambda to load it into Redshift. The data is then used to create a dashboard in Looker. [Data Engineer]

Notifications You must be signed in to change notification settings

mikecerton/UserInsight-Streaming-Data-Pipeline

Repository files navigation

UserInsight-Streaming-Data-Pipeline

Overview

 The UserInsight-Streaming-Data-Pipeline is a real-time data processing pipeline that ingests data from an API (RandomUser API) into Kafka, processes it using Apache Spark, and stores it in AWS S3. An AWS Lambda function is triggered upon the arrival of new data in S3, pulling data into AWS Redshift. The data is then used to create a dashboard in Looker. Key components such as Kafka and Spark can be easily installed and managed using Docker.
!! You can view the dashboard here. !!

Architecture

Architecture

1. Fetch data from the API and send a message to Kafka.
2. Spark will read data from Kafka and process it.
3. The processed data will be stored in S3 as a data lake.
4. When a new file is saved in S3, it will trigger a Lambda function.
5. The Lambda function will load data from S3 into Redshift.
6. Use data from Redshift to create dashboards in Looker Studio for insights and reporting.

Dashboard

Dashboard

I use Looker Studio to create dashboards using data from the data warehouse.

!! You can view the dashboard here. !!

A special note

While developing this project, I connected Looker Studio to AWS Redshift for data. However, due to AWS free tier limits, Redshift cannot run continuously. As a result, the dashboard now uses data from a CSV file exported from Redshift, but it appears the same as when directly connected to Redshift.

Tools & Technologies

  • Cloud: Amazon Web Services (AWS)
  • Containerization - Docker, Docker Compose
  • Stream Processing: Apache Kafka, Apache Spark
  • Data Lake: AWS S3
  • Serverless Computing: AWS Lambda
  • Data Warehouse: AWS Redshift
  • Data Visualization: Looker Studio
  • Programming Language: Python

Set up

1. Clone the Repository

Clone the GitHub repository and navigate to the project directory:

git clone https://github.com/mikecerton/UserInsight-Streaming-Data-Pipeline.git
cd UserInsight-Streaming-Data-Pipeline

2. Set Up AWS Services

Ensure the following AWS resources are created before running the pipeline:

  • Amazon Redshift – for data warehousing
  • Amazon S3 – for storing processed data
  • AWS Lambda – for automating data transfers

3. Configure Environment Variables

Create a .env file in the project root directory and add your AWS and Kafka credentials:

AWS_ACCESS_KEY_ID = you_data
AWS_SECRET_ACCESS_KEY = you_data
AWS_REGION = you_data

kafka_servers = you_data
kafka_cid = you_data
kafka_topic_name = you_data
link_api = you_data

s3_output_path = you_data

redshift_host = you_data
redshift_port = 5you_data
redshift_db = you_data
redshift_user = you_data
redshift_password = you_data
iam_role = you_data

Note: Keep your .env file secure and do not share it publicly.

4. Start Docker Containers

Run the following command to start Kafka and Spark services:

docker-compose -f docker_kafka.yml -f docker_spark.yml up -d

5. Run the Spark Streaming Job

Execute the Spark job to process data from Kafka and store it in S3:

docker exec -it spark-worker /bin/bash
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.5 /opt/bitnami/my_spark/spark_stream_s3.py

Disclaimer

About

UserInsight-Streaming-Data-Pipeline is a real-time pipeline that ingests API data into Kafka, processes it with Spark, stores it in S3, and uses AWS Lambda to load it into Redshift. The data is then used to create a dashboard in Looker. [Data Engineer]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published