DataSanitized is a meticulously curated repository of high-fidelity, pre-processed datasets that have undergone rigorous cleansing to eliminate inconsistencies, anomalies, redundancies, and missing values. This repository serves as a valuable resource for data practitioners by providing structured, analysis-ready datasets optimized for machine learning, deep learning, and advanced data-driven research applications. The repository ensures seamless integration into data engineering pipelines, facilitating efficient data preprocessing workflows.
Raw datasets often exhibit various data quality issues, such as missing values, duplicate records, outliers, and inconsistent formatting. DataSanitized mitigates these challenges by offering cleaned and pre-processed datasets accompanied by well-documented Jupyter Notebooks that detail the preprocessing methodologies and Exploratory Data Analysis (EDA) procedures.
This repository is designed to cater to a broad audience, including:
- Data Scientists & Analysts – To leverage structured datasets for feature engineering and predictive modeling.
- Machine Learning Practitioners – To obtain pre-processed data for supervised and unsupervised learning models.
- AI Researchers – To access well-structured datasets for developing and benchmarking novel algorithms.
- Academicians & Students – To facilitate hands-on learning in data preprocessing and analytics.
The repository follows a standardized hierarchical structure to ensure uniformity and ease of use. Each dataset is encapsulated within a dedicated directory containing the following components:
/dataset_name/
│── raw_data.csv / raw_data.xlsx # Original, unprocessed dataset
│── data_preprocessing.ipynb # Jupyter Notebook with preprocessing pipeline
│── data_cleaned.csv / data_cleaned.xlsx # Cleaned dataset post-preprocessing
│── data_eda.ipynb # Jupyter Notebook with EDA insights and visualizations
- raw_data.csv / raw_data.xlsx – The original dataset obtained from a source before any modifications.
- data_preprocessing.ipynb – A Jupyter Notebook detailing the preprocessing pipeline, including:
- Handling missing data (imputation, deletion, forward/backward fill).
- Encoding categorical features (one-hot encoding, label encoding, ordinal encoding).
- Removing duplicate entries and detecting anomalies.
- Outlier detection and treatment (IQR method, Z-score, isolation forests).
- Data transformation (scaling, normalization, standardization, log transformations).
- Feature selection and engineering techniques.
- data_cleaned.csv / data_cleaned.xlsx – The dataset post-preprocessing, suitable for direct integration into machine learning workflows.
- data_eda.ipynb – A comprehensive EDA Jupyter Notebook that includes:
- Descriptive statistical analysis (mean, median, variance, skewness, kurtosis).
- Data visualization (histograms, density plots, box plots, scatter matrices).
- Correlation heatmaps and feature importance evaluation.
- Insights into dataset distribution, trends, and anomalies.
Datasets in this repository are procured from reputable open-source platforms, ensuring a diverse collection of high-quality data. The primary sources include but are not limited to:
- Kaggle – A repository of publicly available datasets spanning diverse domains.
- UCI Machine Learning Repository – A well-known dataset collection for benchmarking machine learning models.
- Google Dataset Search – A meta-search engine for publicly accessible datasets.
- Data.gov – An extensive repository of government datasets for research and analysis.
- GitHub Open Datasets – A curated collection of publicly available datasets for data science applications.
- OpenML – A collaborative platform for machine learning datasets and benchmarks.
- Statista – A premium database for statistical insights and trend analysis.
- World Bank Open Data – A collection of economic, demographic, and financial datasets.
- Other publicly accessible government, academic, and enterprise-level datasets.
All datasets undergo stringent validation, preprocessing, and quality assurance before their inclusion in this repository.
Contributions to DataSanitized are highly encouraged. If you wish to contribute a dataset or enhance existing datasets, follow these structured steps:
- Fork this repository and create a new dataset folder within the main directory.
- Upload the raw dataset (
.csv
or.xlsx
) along with aREADME.md
explaining its source and structure. - Implement preprocessing and EDA steps in respective Jupyter Notebooks (
data_preprocessing.ipynb
anddata_eda.ipynb
). - Submit a Pull Request (PR) with a detailed summary of the modifications and enhancements.
- Ensure that all submissions adhere to standardized preprocessing methodologies and maintain uniformity in documentation.
To start working with DataSanitized, clone the repository using:
git clone https://github.com/yourusername/DataSanitized.git
Ensure that the required Python libraries are installed to execute preprocessing and EDA scripts:
pip install pandas numpy matplotlib seaborn scikit-learn plotly missingno
Navigate to any dataset folder and run Jupyter Notebooks for preprocessing and exploratory analysis:
jupyter notebook
All datasets, preprocessing scripts, and analysis notebooks within this repository are distributed under the MIT License. Users are encouraged to credit this repository if utilizing the datasets for research, academic purposes, or industrial applications.
If a dataset originates from an external source, attribution should be provided as per the dataset's original licensing terms.
For inquiries, feedback, or collaboration opportunities, reach out via:
- LinkedIn: Rayyan Ashraf
- GitHub Issues: Open a new issue in the repository for support or feature requests.
Rayyan Ashraf – Curator and Maintainer of the DataSanitized Repository.
📌 Empowering data-driven research with high-quality, pre-processed datasets!