Skip to content

This project evaluates supervised learning algorithms using three diverse datasets from Kaggle. It focuses on model performance, hyperparameter tuning, data standardization, and managing overfitting, aiming to find the best algorithms and improve generalization.

Notifications You must be signed in to change notification settings

leonahasanii/DataScienceLab

Repository files navigation

Data Science Lab Project: Performance of Predictive Models (Supervised Learning Analysis) - The Interpretability and Explainability Across Multi-Domains: Health, Environment, and Business (Classification Problems)

Objective:

Evaluated the performance of various supervised learning algorithms in predicting binary target variables and addressed overfitting issues across three diverse datasets from Kaggle.

Domains & Datasets:

  • Health: Cardiovascular Dataset

  • Environment: Weather in Australia

  • Business: Hotel Reservations

Key Questions:

  • Impact of sophisticated modeling methods

  • Model transferability across datasets

  • Effects of standardization techniques

  • Handling imbalanced datasets

  • Identifying optimal hyperparameters

  • Mitigating overfitting issues

Methodology:

  • Preprocessing: Data cleaning, handling missing values, and feature selection.

  • Exploratory Data Analysis (EDA): Identified key patterns and insights.

  • Modeling: Evaluated various supervised learning algorithms.

  • Performance Metrics: Accuracy, precision, recall, F1-score, ROC AUC.

Techniques & Tools:

  • Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)

  • R (tidyverse, caret)

  • Machine Learning Models: Random Forest, Gradient Boosting, Logistic Regression, Decision Tree, KNN

  • Additional Techniques: Feature Engineering, Hyperparameter Tuning, Cross-Validation

Results:

Business Field - Hotel Reservations

  • Analyzed booking status data to predict cancellations.

  • Key Techniques: Random Forest Classifier, Learning curves, Overfitting analysis.

  • Achievements: Achieved high performance with a balanced model (max depth 10-15), ensuring robust generalization.

Environmental Domain - Weather in Australia

  • Processed weather data to predict rainfall.

  • Key Techniques: Missing value handling, Feature selection, Random Forest Classifier.

  • Achievements: Optimal performance with 100 estimators and max depth of 6, with diminishing returns beyond this setup.

Health Sector - Cardiovascular Dataset

  • Cleaned and analyzed health data to predict cardiovascular diseases.

  • Key Techniques: Outlier removal, Class imbalance handling, Gradient Boosting.

  • Achievements: Balanced class distribution via undersampling, best performance using Gradient Boosting with resampling. The optimal max depth appears to be around 12.

Impact:

  • Algorithm Performance: Provided insights into the behavior of different algorithms across various domains.

  • Generalization: Ensured models generalized well to unseen data, mitigating overfitting.

  • Trade-offs: Highlighted the trade-offs involved in model complexity and performance.

Key Achievements:

  • Developed robust frameworks for handling missing values and imbalanced datasets.

  • Demonstrated model stability and effective generalization through cross-validation and learning curve analysis.

  • Presented findings through performance metrics tables and visualizations, guiding future model selection.

  • Compiled comprehensive reports and delivered presentations to stakeholders, effectively communicating complex technical concepts and findings.

Skills Demonstrated:

  • Data Collection & Preprocessing: Mastered data cleaning, feature selection, and handling missing values.

  • Advanced Analytics: Applied and evaluated various machine learning models, ensuring optimal performance and generalization.

  • Technical Proficiency: Utilized Python and R for in-depth data analysis and model development.

  • Communication: Effectively communicated complex technical concepts and findings to non-technical audiences through reports and presentations.

Key Achievements:

  • High-Impact Insights: Identified and validated the best imputation technique and model combination for each dataset.

  • Model Stability: Ensured model stability and effective generalization through cross-validation and learning curve analysis.

  • Innovative Approaches: Employed advanced imputation and machine learning techniques to solve real-world data challenges.

About

This project evaluates supervised learning algorithms using three diverse datasets from Kaggle. It focuses on model performance, hyperparameter tuning, data standardization, and managing overfitting, aiming to find the best algorithms and improve generalization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published