Data-Streaming-Flink-Spark

This repository demonstrates a real-time data streaming and persistence workflow using Apache Flink, Apache Spark, and Grafana for monitoring. The pipeline streams data, persists it in Parquet format, and performs aggregations for analytical insights.

Prerequisites

Ensure the following software is installed on your system:

Apache Flink
Apache Spark
Python 3.x
Grafana

For installation guidance, refer to tutorials available on YouTube or other online resources.

Setup and Execution

Follow these steps to set up and execute the data streaming pipeline:

1. Start the Flink Cluster

Navigate to the bin folder in Flink's directory.
Run the command:
```
./start-cluster.sh
```

2. Stream Data with Flink

Execute the following command to start streaming:
```
python3 flink_streaming.py
```
The streaming process will begin, and you can monitor the progress in the terminal.

3. Persist Data with Spark

Run the following command to process and persist data:
```
python3 spark_persist.py
```
This step will create seven Parquet files in a folder named spark_persisted_output located in your home directory.

4. Perform Aggregations

Execute the following command to perform data aggregation:
```
python3 streaming_aggregates.py
```
The output will be stored in a folder named aggregated_output in your home directory.

5. Set Up Grafana Dashboard

Download and install Grafana.
Visit Grafana at http://localhost:3000 (default port).
Create a dashboard and import the provided JSON file to visualize the streaming and aggregated data.

Note

Update the paths for each Python file (flink_streaming.py, spark_persist.py, streaming_aggregates.py) according to your system setup.
Ensure all required dependencies are installed and configured correctly.
You can download the original dataset from: Download

Repository Structure

Data-Streaming-Flink-Spark/
├── flink_streaming.py        # Flink streaming script
├── spark_persist.py          # Spark persistence script
├── streaming_aggregates.py   # Data aggregation script
├── dashboard.json    # Grafana dashboard configuration file
├── spark_persisted_output/   # Output folder for Spark persisted data
├── aggregated_output/        # Output folder for aggregated data

Contact

For queries or contributions, please contact: Tashfeen Abbasi
Email: abbasitashfeen7@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Screenshots		Screenshots
aggregated_output/2025-01-14--15		aggregated_output/2025-01-14--15
spark_persisted_output		spark_persisted_output
README.md		README.md
dashboard.json		dashboard.json
flink_streaming.py		flink_streaming.py
spark_persist.py		spark_persist.py
streaming_aggregates.py		streaming_aggregates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Streaming-Flink-Spark

Prerequisites

Setup and Execution

1. Start the Flink Cluster

2. Stream Data with Flink

3. Persist Data with Spark

4. Perform Aggregations

5. Set Up Grafana Dashboard

Note

Repository Structure

Contact

About

Languages

tashi-2004/Data-Streaming-Flink-Spark

Folders and files

Latest commit

History

Repository files navigation

Data-Streaming-Flink-Spark

Prerequisites

Setup and Execution

1. Start the Flink Cluster

2. Stream Data with Flink

3. Persist Data with Spark

4. Perform Aggregations

5. Set Up Grafana Dashboard

Note

Repository Structure

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages