This repository houses an implementation of finding similar items utilising A-Priori and PCY Algorithms on Apache Kafka.
Using a 12GB .json file as a sample of the 100+GB Amazon_Reviews Dataset, it was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).
- Apache Kafka for robust real-time data streaming.
- (Optional) Use of Azure VMs and Blobs, providing a scalable solution for large datasets.
└── preprocessing.py # Script for preprocessing data locally
└── sampling.py # Script used to randomly sample the original 100+GB into 15 GB.
├── preprocessing_for_azure.py # Script for preprocessing and loading data to Azure Blob Storage
├── blob_to_kafka_producer.py # Script for streaming data from Azure Blob to Kafka
├── consumer1.py # Kafka consumer implementing the Apriori algorithm
├── consumer2.py # Kafka consumer implementing the PCY algorithm
└── consumer3.py # Kafka consumer for anomaly detection
└── producer_for_1_2.py # Kafka producer for Apriori and PCY consumers
└── producer_for_3.py # Kafka producer for Anomaly detection consumer
The first step is to download and preprocess the Amazon Metadata dataset.
└── Preprocessing_for_azure.py if using Azure,
└── Preprocessing.py if not.
Next up is setting up Kafka (and optionally Azure Blob Storage):
Then deploy the consumer scripts:
Sliding Window Approach Approximation Techniques
This project leverages Apache Kafka and a sliding window approach for real-time data processing due to several key advantages:
Kafka's distributed architecture allows for horizontal scaling by adding more nodes to the cluster. This ensures the system can handle ever-increasing data volumes in e-commerce scenarios without performance degradation.
Traditional batch processing wouldn't be suitable for real-time analytics. The sliding window approach, implemented within Kafka consumers, enables processing data chunks (windows) as they arrive in the stream. This provides near real-time insights without waiting for the entire dataset.
Kafka's high throughput and low latency are crucial for e-commerce applications. With minimal delays in data processing, businesses can gain quicker insights into customer behavior and product trends, allowing for faster decision-making.
While Azure Blob Storage provides excellent cloud storage for the preprocessed data, and Azure VMs allow for easier clustering, it's Kafka that facilitates the real-time processing aspects crucial for this assignment's goals. The combination of Kafka's streaming capabilities and the sliding window approach within consumers unlocks the power of real-time analytics for e-commerce data.