This project implements a two-stage machine learning approach combining unsupervised clustering and supervised classification. The analysis was conducted using provided templates and follows specific criteria for both stages as part of the Dicoding Data Science Bootcamp (January 2025).
The analysis uses the "Customer Shopping Dataset - Retail Sales Data" from Kaggle, which contains shopping information from 10 different shopping malls in Istanbul between 2021 and 2023.
invoice_no
: Unique identifier for each transaction (Format: 'I' + 6-digit integer)customer_id
: Unique identifier for each customer (Format: 'C' + 6-digit integer)gender
: Customer's genderage
: Customer's agecategory
: Product categoryquantity
: Number of items purchasedprice
: Unit price in Turkish Liras (TL)payment_method
: Payment type (cash, credit card, or debit card)invoice_date
: Transaction dateshopping_mall
: Location of the transaction
clustering.ipynb
: Notebook for clustering analysis in Bahasa Indonesiaclassification.ipynb
: Notebook for classification analysis in Bahasa Indonesia
clustering_en.ipynb
: English version of the clustering analysis notebookclassification_en.ipynb
: English version of the classification analysis notebook
-
Dataset Requirements
Minimum two columns:
- One categorical column
- One numerical column This combination enables meaningful cluster formation
-
Performance Metrics
- Achieved Silhouette Score: ≥ 0.55
- This score indicates well-formed clusters with good separation
-
Cluster Interpretation
- Detailed analysis of cluster characteristics
- Data distribution within clusters
- Insights derived from clustering results
-
Dataset
- Uses labeled data from clustering results
- Labels from clustering serve as classification targets
-
Model Performance
- Minimum accuracy: 87% (both training and testing sets)
- Minimum F1-Score: 87% (both training and testing sets)
- Python 3.x
- Refer to the requirements.txt file for all required libraries.
✅ Completed
- Successfully implemented clustering analysis
- Achieved required Silhouette Score
- Completed classification with required accuracy metrics
- Detailed interpretations for project results
- Recommendation for future submission criteria provided
This project is developed as part of my submission for the Dicoding Indonesia Data Science Bootcamp Batch 4 (2024).
If you find this project helpful or have ideas for improvements, I’d love to hear from you! Thank you :)