This repository contains the coursework project focused on analyzing and predicting water quality using various machine learning techniques. The aim is to evaluate the effectiveness of different classifiers in predicting water quality based on several input parameters.
- Data Collection: Identification and extraction of relevant water quality data from external sources.
- Literature Review: Examination of existing research and literature to understand the domain and the applicability of various machine learning methods.
- Method Justification: Justification for the selection of specific data analysis and machine learning techniques.
- Model Implementation and Evaluation: Application and comparison of the effectiveness of different machine learning models, including K-Nearest Neighbors (KNN), Decision Tree Classifier (DTC), and Random Forest Classifier (RFC).
- Documentation: Preparation of the explanatory note detailing the methodology, analysis, and findings.
- Submission and Defense: Submission of the project for review and defense of the coursework.
The software developed for this project includes the following functionality:
- Loading the dataset.
- Data preprocessing and exploration.
- Applying machine learning models.
- Predicting water quality.
- Analyzing factors affecting water quality.
- Visualizing and analyzing the results.
- K-Nearest Neighbors (KNN)
- High impact: Total Dissolved Solids, Chloride, Sulfate.
- Low impact: Lead, Manganese, Iron.
- Random Forest Classifier (RFC)
- High impact: pH, Turbidity, Manganese.
- Low impact: Lead, Total Dissolved Solids.
- Decision Tree Classifier (DTC)
- High impact: pH, Chloride, Turbidity, Manganese.
- Low impact: Lead, Zinc, Sulfate.
The analysis of water quality using KNN, RFC, and DTC models provided insights into the most influential parameters affecting water quality. Among the three methods, Random Forest Classifier demonstrated the highest accuracy and effectiveness in classifying water quality, making it the most suitable method for this analysis.