Skip to content

Project implements a scalable data pipeline architecture that combines Apache Spark's processing capabilities with AWS services for data storage, cataloging, and analysis

Notifications You must be signed in to change notification settings

Robso-creator/spark-streaming-unstructured-data

Repository files navigation

Data Pipeline Architecture

pre-commit makefile docker

Overview

This project implements a scalable data pipeline architecture that combines Apache Spark's processing capabilities with AWS services for data storage, cataloging, and analysis. The pipeline enables efficient processing of multiple data formats and provides the possibility of various visualization options for data analysis.

System Architecture

img.png

Architecture Components

Data Ingestion

Supports multiple input file formats:

  • CSV
  • TXT
  • Parquet
  • JSON
  • Other structured and semi-structured formats

Uses Spark Streaming for real-time data processing

Processing Layer

  • Apache Spark Infrastructure
    • Driver: Manages application execution and coordinates processing
    • Master Node: Orchestrates cluster resources and task distribution
    • Worker Nodes: Execute distributed data processing tasks

AWS Integration

  • Storage & Processing

    • Amazon S3: Raw data storage and data lake implementation
    • AWS Glue: Managed ETL service for data transformation
    • AWS Data Catalog: Central metadata repository
    • Glue Crawler: Automated metadata discovery and schema inference
    • Amazon Redshift: Enterprise data warehouse for complex analytics
    • Amazon Athena: Serverless query service for S3 data

Key Features

  • Real-time data processing capabilities
  • Scalable distributed processing
  • Automated metadata management
  • Flexible data visualization options
  • Support for multiple data formats
  • Serverless query capabilities
  • Enterprise-grade data warehousing

Getting Started

  • Configure AWS credentials and permissions, add access and secret keys to the .env file:
    AWS_ACCESS_KEY=
    AWS_SECRET_KEY=
  • Run commands to start the Spark cluster and execute the Spark Streaming application:
    make up
    make run

Contributing

For contributing to this project, please follow the standard GitHub flow:

  • Fork the repository
  • Create a feature branch
  • Submit a pull request

About

Project implements a scalable data pipeline architecture that combines Apache Spark's processing capabilities with AWS services for data storage, cataloging, and analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published