This project implements a scalable data pipeline architecture that combines Apache Spark's processing capabilities with AWS services for data storage, cataloging, and analysis. The pipeline enables efficient processing of multiple data formats and provides the possibility of various visualization options for data analysis.
Supports multiple input file formats:
- CSV
- TXT
- Parquet
- JSON
- Other structured and semi-structured formats
Uses Spark Streaming for real-time data processing
- Apache Spark Infrastructure
- Driver: Manages application execution and coordinates processing
- Master Node: Orchestrates cluster resources and task distribution
- Worker Nodes: Execute distributed data processing tasks
-
Storage & Processing
- Amazon S3: Raw data storage and data lake implementation
- AWS Glue: Managed ETL service for data transformation
- AWS Data Catalog: Central metadata repository
- Glue Crawler: Automated metadata discovery and schema inference
- Amazon Redshift: Enterprise data warehouse for complex analytics
- Amazon Athena: Serverless query service for S3 data
- Real-time data processing capabilities
- Scalable distributed processing
- Automated metadata management
- Flexible data visualization options
- Support for multiple data formats
- Serverless query capabilities
- Enterprise-grade data warehousing
- Configure AWS credentials and permissions, add access and secret keys to the
.env
file:AWS_ACCESS_KEY= AWS_SECRET_KEY=
- Run commands to start the Spark cluster and execute the Spark Streaming application:
make up make run
For contributing to this project, please follow the standard GitHub flow:
- Fork the repository
- Create a feature branch
- Submit a pull request