The UserInsight-Streaming-Data-Pipeline is a real-time data processing pipeline that ingests data from an API (RandomUser API) into Kafka, processes it using Apache Spark, and stores it in AWS S3. An AWS Lambda function is triggered upon the arrival of new data in S3, pulling data into AWS Redshift. The data is then used to create a dashboard in Looker. Key components such as Kafka and Spark can be easily installed and managed using Docker.
!! You can view the dashboard here. !!
2. Spark will read data from Kafka and process it.
3. The processed data will be stored in S3 as a data lake.
4. When a new file is saved in S3, it will trigger a Lambda function.
5. The Lambda function will load data from S3 into Redshift.
6. Use data from Redshift to create dashboards in Looker Studio for insights and reporting.
I use Looker Studio to create dashboards using data from the data warehouse.
!! You can view the dashboard here. !!
While developing this project, I connected Looker Studio to AWS Redshift for data. However, due to AWS free tier limits, Redshift cannot run continuously. As a result, the dashboard now uses data from a CSV file exported from Redshift, but it appears the same as when directly connected to Redshift.
- Cloud: Amazon Web Services (AWS)
- Containerization - Docker, Docker Compose
- Stream Processing: Apache Kafka, Apache Spark
- Data Lake: AWS S3
- Serverless Computing: AWS Lambda
- Data Warehouse: AWS Redshift
- Data Visualization: Looker Studio
- Programming Language: Python
Clone the GitHub repository and navigate to the project directory:
git clone https://github.com/mikecerton/UserInsight-Streaming-Data-Pipeline.git
cd UserInsight-Streaming-Data-Pipeline
Ensure the following AWS resources are created before running the pipeline:
- Amazon Redshift – for data warehousing
- Amazon S3 – for storing processed data
- AWS Lambda – for automating data transfers
Create a .env file in the project root directory and add your AWS and Kafka credentials:
AWS_ACCESS_KEY_ID = you_data
AWS_SECRET_ACCESS_KEY = you_data
AWS_REGION = you_data
kafka_servers = you_data
kafka_cid = you_data
kafka_topic_name = you_data
link_api = you_data
s3_output_path = you_data
redshift_host = you_data
redshift_port = 5you_data
redshift_db = you_data
redshift_user = you_data
redshift_password = you_data
iam_role = you_data
Note: Keep your .env file secure and do not share it publicly.
Run the following command to start Kafka and Spark services:
docker-compose -f docker_kafka.yml -f docker_spark.yml up -d
Execute the Spark job to process data from Kafka and store it in S3:
docker exec -it spark-worker /bin/bash
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.5 /opt/bitnami/my_spark/spark_stream_s3.py
- RandomUser :
https://randomuser.me/documentation - Apache Kafka :
https://kafka.apache.org/documentation/
https://hub.docker.com/r/apache/kafka - Apache Spark :
https://spark.apache.org/docs/latest/
https://hub.docker.com/_/spark/
https://bitnami.com/stacks/spark
https://hub.docker.com/r/bitnami/spark/ - AWS :
https://docs.aws.amazon.com/s3/
https://docs.aws.amazon.com/redshift/
https://docs.aws.amazon.com/lambda/