This repository demonstrates a real-time data streaming and persistence workflow using Apache Flink, Apache Spark, and Grafana for monitoring. The pipeline streams data, persists it in Parquet format, and performs aggregations for analytical insights.
Ensure the following software is installed on your system:
- Apache Flink
- Apache Spark
- Python 3.x
- Grafana
For installation guidance, refer to tutorials available on YouTube or other online resources.
Follow these steps to set up and execute the data streaming pipeline:
-
Execute the following command to start streaming:
python3 flink_streaming.py
-
The streaming process will begin, and you can monitor the progress in the terminal.
-
Run the following command to process and persist data:
python3 spark_persist.py
-
This step will create seven Parquet files in a folder named
spark_persisted_output
located in your home directory.
- Execute the following command to perform data aggregation:
python3 streaming_aggregates.py
- The output will be stored in a folder named
aggregated_output
in your home directory.
- Download and install Grafana.
- Visit Grafana at
http://localhost:3000
(default port). - Create a dashboard and import the provided JSON file to visualize the streaming and aggregated data.
- Update the paths for each Python file (
flink_streaming.py
,spark_persist.py
,streaming_aggregates.py
) according to your system setup. - Ensure all required dependencies are installed and configured correctly.
- You can download the original dataset from: Download
Data-Streaming-Flink-Spark/
├── flink_streaming.py # Flink streaming script
├── spark_persist.py # Spark persistence script
├── streaming_aggregates.py # Data aggregation script
├── dashboard.json # Grafana dashboard configuration file
├── spark_persisted_output/ # Output folder for Spark persisted data
├── aggregated_output/ # Output folder for aggregated data
For queries or contributions, please contact:
Tashfeen Abbasi
Email: abbasitashfeen7@gmail.com