This repository contains the implementation of a predictive model for the GSTN Hackathon analytics challenge. The goal was to construct a model Fθ(X) → Ypred
that accurately estimates the target variable Yi
for unseen inputs Xi
. This project focused on advanced data preprocessing, handling imbalanced datasets, and developing an optimal neural network solution.
Dataset Provided:
- Training Data (Dtrain): Matrix of dimension
R(m × n)
- Test Data (test): Matrix of dimension
R(m1 × n)
- Target Variables:
Ytrain
: DimensionR(m × 1)
Ytest
: DimensionR(m1 × 1)
Objective: Develop a model that accurately predicts Ypred
for unseen test data.
-
Dropping Unnecessary Columns:
- Removed ID (irrelevant for predictions).
- Dropped Column9 due to excessive missing values.
-
Selective Row Removal:
- Dropped rows where the target variable was
0
and contained null values in key columns.
- Dropped rows where the target variable was
-
Handling Missing Values:
- Applied K-Nearest Neighbors (KNN) Imputer to fill missing values based on similar data points.
-
Balancing the Dataset:
- Used random sampling to create a balanced training set, addressing class imbalance for better model performance.
- Architecture:
- Hyperparameters tuned using Grid Search.
- Explored configurations of hidden layers, activation functions, and solvers.
- Results on Test Data:
- Accuracy: 97.58%
- Precision: 0.83
- Recall: 0.93
- F1-Score: 0.88
- AUC-ROC: 0.99
-
Ensured balanced class distribution in each fold.
-
Mean Results on Training Data:
- Accuracy: 98.20%
- Precision: 0.92
- Recall: 0.95
- F1-Score: 0.93
-
Test Data Results:
- Accuracy: 97.58%
- Precision: 0.83
- Recall: 0.93
- F1-Score: 0.88
- AUC-ROC: 0.99
- Fine-tuned hyperparameters including hidden layer sizes, activation functions, solvers, and learning rates.
- Improved Results on Test Data:
- Accuracy: 97.60%
- Precision: 0.84
- Recall: 0.93
- F1-Score: 0.89
- AUC-ROC: 0.99
-
Accuracy (97.58%)
- The model correctly classified 97.58% of test instances.
-
Precision (0.84)
- 84% of instances predicted as positive were true positives, reducing false positives.
-
Recall (0.93)
- 93% of actual positive cases were correctly detected, minimizing false negatives.
-
F1-Score (0.89)
- Balances precision and recall, reflecting overall performance.
-
AUC-ROC (0.99)
- Indicates excellent model capability in distinguishing between positive and negative classes.
- The Neural Network model, enhanced with Stratified K-Fold Cross-Validation and Grid Search, outperformed other approaches (e.g., linear and logistic regression).
- Achieved high performance metrics, making it robust for deployment in GST-related predictive analytics.
- Clone this repository:
git clone https://github.com/raged-pineapple/GST-Analytics-Hackathon.git