This case study explores a data-driven approach to predicting medical costs using demographic, lifestyle, and derived factors such as BMI category, age group, smoker status, family size, risk group, and cost ranges. Using the Medical Cost Dataset from Kaggle, the project involved data cleaning, feature engineering, exploratory analysis, and building machine learning models—specifically Linear Regression and Random Forest—to forecast medical expenses. The analysis highlights key cost drivers such as smoking, BMI, and risk group, emphasizing preventive healthcare's role in reducing expenses. This work demonstrates how predictive modeling can provide actionable insights to address healthcare cost challenges effectively.
- Ask: Problem Statement
- Prepare: Dataset & Understanding the Data
- Process: Data Cleaning & Preprocessing
- Analyze: Exploratory Data Analysis (EDA)
- Model: Predicting Medical Charges with Regression
- Share: Interactive Dashboard & Visualization
- Act: Insights and Future Improvements
- Project Resources
Healthcare costs are a significant concern for individuals and insurance companies. This project aims to predict medical costs based on demographic, lifestyle, and derived factors such as age, BMI, smoking status, and new categorical features. By leveraging machine learning techniques, the goal is to develop a model that predicts medical expenses for individuals.
- Target Variable:
charges
(Medical cost in USD) - Objective: Build a regression model to predict medical charges based on individual features and engineered categories.
The dataset used for this project is the Medical Cost Dataset on Kaggle, which includes demographic and medical information about individuals.
- Age: Age of the individual
- Sex: Gender of the individual
- BMI: Body Mass Index
- Children: Number of children/dependents
- Smoker_binary: Encoded smoker status into
0
(Non-smoker) and1
(Smoker). - Region: Geographical region (e.g., Southeast, Southwest, etc.)
- BMI_category: Categorized BMI into
Underweight
,Normal
,Overweight
, andObese
. - Age_category: Grouped ages into categories such as
Young
,Middle-aged
, andSenior
. - Smoker: Whether the individual is a smoker (
Yes
/No
) - Family_size: Derived from the
children
column into categories likeNo children
andLarge family
. - Risk_group: Categorized individuals into
Low-risk
,Medium-risk
, andHigh-risk
based on smoking and BMI. - Cost_ranges: Grouped medical charges into ranges such as
Low
,Medium
, andHigh
.
- charges: The medical expenses in USD that an individual incurs.
To prepare the data for analysis, several preprocessing steps were performed:
- Data Cleaning: Removed irrelevant or duplicate columns and handled missing values.
- Feature Engineering:
- Derived new features like
BMI_category
,Age_category
, andRisk_group
. - Converted categorical features (
sex
,smoker
,region
) and derived features into numerical values. - Checked correlations between features and medical charges.
- Derived new features like
- Feature Scaling: Standardized numerical features such as
age
andBMI
to improve model performance. - Data Splitting: Divided the dataset into training (80%) and testing (20%) sets for model evaluation.
Exploratory analysis was performed to uncover insights and relationships between original and derived features.
- Understand data distribution and relationships.
- Identify significant predictors of medical costs.
- Visualize the data using histograms, scatterplots, bar charts, and correlation heatmaps.
Key Findings from EDA:
- Smokers incur significantly higher medical costs than non-smokers.
- Individuals categorized as Obese have higher average costs than those in lower BMI categories.
- High-risk groups, as defined by the risk group feature, represent the highest average medical charges.
Two models were trained to predict medical costs based on the identified features:
-
Linear Regression:
- RMSE: 5521.50
- MAE: 3742.80
- R-squared: 0.761
-
Random Forest:
- RMSE: 5514.34
- MAE: 3944.00
- R-squared: 0.754
- Both models performed comparably, with Linear Regression slightly outperforming Random Forest in R-squared.
- Smoker_binary, BMI_category, and Risk_group were identified as the most significant predictors of higher medical costs.
An interactive Tableau Dashboard is under development to showcase:
- Predictions and model evaluation metrics.
- Key feature importance, including derived categories like
Risk_group
andBMI_category
. - Interactive exploration of medical cost trends based on user-selected features.
Interactive Tableau Dashboard
Key Insights:
- Smoking is a major contributor to higher medical costs, indicating a strong need for targeted smoking cessation programs.
- High-risk individuals consistently face higher medical charges, underscoring the importance of preventive care and personalized interventions.
- Categorical features like
BMI category
andRisk group
offer valuable insights into medical cost patterns, making the model more interpretable and actionable. Future Directions:
Feature Expansion:
- Include additional features such as family history, lifestyle factors, and socioeconomic status to enhance model accuracy.
- Explore advanced feature engineering (e.g., interaction terms, transformations) to uncover more complex data relationships.
Model Improvement:
- Experiment with advanced algorithms like XGBoost and Neural Networks for better prediction performance.
- Fine-tune models with hyperparameter optimization to improve accuracy.
Dashboard Enhancement:
- Improve interactive visualizations for better user experience and more actionable insights.
- Dataset: Medical Cost Dataset on Kaggle
- Cleaned Data: Google Sheets Link
- Tableau Dashboard: Interactive Tableau Dashboard