A comprehensive credit risk assessment model using machine learning techniques to predict loan defaults. The model achieves an AUROC of 0.85 and a Gini coefficient of 0.71, demonstrating strong predictive performance for credit risk evaluation.
- Advanced credit risk modeling using logistic regression
- Extensive feature engineering with Weight of Evidence (WoE) transformation
- Handling of imbalanced data using oversampling techniques
- Comprehensive model evaluation with multiple metrics
- Built-in data preprocessing pipeline
- AUROC: 0.85
- Gini Coefficient: 0.71
- KS Statistic: 0.56
- Precision-Recall AUC: 0.98
- Feature engineering using Weight of Evidence (WoE)
- Information Value (IV) calculations for feature selection
- Handling of categorical variables through dummy encoding
- Treatment of imbalanced data using Random Oversampling
- Python 3.x
- Key Libraries:
- scikit-learn for model building
- pandas for data manipulation
- numpy for numerical operations
- matplotlib and seaborn for visualization
- imblearn for handling imbalanced data
- yellowbrick for model visualization
βββ data/
β βββ loan_data_2007_2014.csv
βββ notebooks/
β βββ 01_data_preprocessing.ipynb
β βββ 02_feature_engineering.ipynb
β βββ 03_model_building.ipynb
βββ src/
β βββ preprocessing.py
β βββ feature_engineering.py
β βββ model.py
βββ models/
β βββ credit_risk_model.sav
βββ README.md
- Clone the repository:
git clone https://github.com/username/credit-risk-modeling.git
- Install required packages:
pip install -r requirements.txt
- Run the notebooks in sequence:
jupyter notebook notebooks/01_data_preprocessing.ipynb
The model uses a variety of features including:
- Loan characteristics (term, interest rate, grade)
- Borrower demographics (home ownership, employment length)
- Credit history (credit inquiries, revolving credit utilization)
- Payment behavior (total payments, recovery amounts)
- Geographic information (state-level data)
The model has been evaluated using multiple metrics:
- ROC curve analysis
- Precision-Recall curves
- KS statistics
- Confusion matrix
- Classification reports
- Missing value treatment
- Feature engineering using WoE
- Information Value calculation
- Categorical variable encoding
- Feature selection based on IV
- Data balancing using oversampling
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and feedback, please open an issue in the repository.
- Thanks to all contributors who have helped with the development
- Special thanks to the scikit-learn and imblearn communities