Skip to content

This project showcases a real-time data streaming pipeline using Apache Flink, Apache Spark, and Grafana. It streams data, stores it in Parquet format, and performs aggregations for insights, with seamless visualization via Grafana dashboards.

Notifications You must be signed in to change notification settings

tashi-2004/Data-Streaming-Flink-Spark

Repository files navigation

Data-Streaming-Flink-Spark

This repository demonstrates a real-time data streaming and persistence workflow using Apache Flink, Apache Spark, and Grafana for monitoring. The pipeline streams data, persists it in Parquet format, and performs aggregations for analytical insights.


Prerequisites

Ensure the following software is installed on your system:

  • Apache Flink
  • Apache Spark
  • Python 3.x
  • Grafana

For installation guidance, refer to tutorials available on YouTube or other online resources.


Setup and Execution

Follow these steps to set up and execute the data streaming pipeline:

1. Start the Flink Cluster

  1. Navigate to the bin folder in Flink's directory.
  2. Run the command:
    ./start-cluster.sh
    Confirmation_JPS

2. Stream Data with Flink

  1. Execute the following command to start streaming:

    python3 flink_streaming.py
  2. The streaming process will begin, and you can monitor the progress in the terminal. Flink_Stream

3. Persist Data with Spark

  1. Run the following command to process and persist data:

    python3 spark_persist.py

    spark_persist

  2. This step will create seven Parquet files in a folder named spark_persisted_output located in your home directory.

    persisted_output

4. Perform Aggregations

  1. Execute the following command to perform data aggregation:
    python3 streaming_aggregates.py
    Daily_TotalRevenue
  2. The output will be stored in a folder named aggregated_output in your home directory. Daily_TotalRevenue_OutputFile

5. Set Up Grafana Dashboard

  1. Download and install Grafana.
  2. Visit Grafana at http://localhost:3000 (default port).
  3. Create a dashboard and import the provided JSON file to visualize the streaming and aggregated data. 222 1

Note

  • Update the paths for each Python file (flink_streaming.py, spark_persist.py, streaming_aggregates.py) according to your system setup.
  • Ensure all required dependencies are installed and configured correctly.
  • You can download the original dataset from: Download

Repository Structure

Data-Streaming-Flink-Spark/
├── flink_streaming.py        # Flink streaming script
├── spark_persist.py          # Spark persistence script
├── streaming_aggregates.py   # Data aggregation script
├── dashboard.json    # Grafana dashboard configuration file
├── spark_persisted_output/   # Output folder for Spark persisted data
├── aggregated_output/        # Output folder for aggregated data

Contact

For queries or contributions, please contact: Tashfeen Abbasi
Email: abbasitashfeen7@gmail.com

About

This project showcases a real-time data streaming pipeline using Apache Flink, Apache Spark, and Grafana. It streams data, stores it in Parquet format, and performs aggregations for insights, with seamless visualization via Grafana dashboards.

Topics

Resources

Stars

Watchers

Forks

Languages