Skip to content

After reading two scientific articles we recreated them using the various machine learning methods we learned. The articles worked on a Chronic Kidney Disease dataset available on UCI. We made 3 Jupyter notebooks, one for each article and a final notebook that had our improvements in it.

Notifications You must be signed in to change notification settings

yasminedaly/Chronic-Kidney-Disease-Classification

Repository files navigation

Chronic-Kidney-Disease-Classification

As part of my academic Machine Learning module, I was part of a team of 5 colleagues where we had to read two scientific articles and recreate them using the various machine learning methods we learned. The articles worked on a Chronic Kidney Disease dataset available on UCI. We made 3 Jupyter notebooks, one for each article and a final notebook that had our improvements in it.

For the first article:

  • We performed an exploratory analysis of the dataset
  • Checked for missing values and outliers
  • Imputed missing values using two statistical methods: mean for quantitative variables and mode for qualitative (categorical) variables.
  • Then we performed feature selection using Recursive Feature Elimination with Cross Validation (RFECV) which is a wrapper method for feature selection that fits a model on the dataset and determines how significant the features are. We used a Random Forest as an estimator.
  • For the modelling we used Support Vector Machine, Random Forest Classifier, Decision Tree Classifier and K-Nearest Neighbors. All of the models were trained on a 75% of the data with a Stratified split due to the class imbalance.
  • Finally, we evaluated the models with and without feature selection by looking at their Accuracy rates, Precision, Recall, and F1 Score.

For Article 2:

  • Data Preparation was done in the same manner.
  • We used Correlation-Based Feature Selection
  • For modelling we used 3 classifiers: K-Nearest Neighbors, Support Vector Machine and Naïve Bayes. We fitted them on the original data, then on the data after feature selection and finally we used AdaBoost + CFS. For K-Nearest Neighbors, we couldn't use AdaBoost because KNN doesn't have weights, there for we had to introduce Weighted KNN.

Finally, for the improvements,

  • We performed a thorough analysis for the missing values, and we ended up imputing each missing value in relation to values from other features. If someone had a missing value for Random Glucose and had diabetes, we imputed them with the mean of Random Glucose for observations that had diabetes. We hoped that this way we could get a more accurate dataset.
  • For Feature Selection, we used RFECV.
  • And for the modelling we used all the previously mentioned classifiers with hyperparameter tuning (as was the case for the articles as well) and we introduced XGBoost Classifier as well.
  • To treat the case of imbalance, we used SMOTE oversampling to introduce more observations of the notckd weak class and refitted the models on this new augmented data.
  • Finally we deployed a proof of concept dashboard using Streamlit Cloud which you can access using this link: https://lnkd.in/eBbEtpFY

In conclusion, this has been a fruitful academic activity in which I've learned so much. I'm glad to say that my team got the highest mark for this amazing work. I hope to continue growing my machine learning knowledge.

About

After reading two scientific articles we recreated them using the various machine learning methods we learned. The articles worked on a Chronic Kidney Disease dataset available on UCI. We made 3 Jupyter notebooks, one for each article and a final notebook that had our improvements in it.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published