Skip to content

DataSanitized – A meticulously curated repository of pre-processed datasets, stripped of inconsistencies, anomalies, and redundancies. This resource provides high-quality, structured data suitable for advanced analysis, model training, and research applications, ensuring seamless integration into data pipelines and analytical workflows.

License

Notifications You must be signed in to change notification settings

etsryn/DataSanitized

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

DataSanitized

DataSanitized is a meticulously curated repository of high-fidelity, pre-processed datasets that have undergone rigorous cleansing to eliminate inconsistencies, anomalies, redundancies, and missing values. This repository serves as a valuable resource for data practitioners by providing structured, analysis-ready datasets optimized for machine learning, deep learning, and advanced data-driven research applications. The repository ensures seamless integration into data engineering pipelines, facilitating efficient data preprocessing workflows.

Purpose

Raw datasets often exhibit various data quality issues, such as missing values, duplicate records, outliers, and inconsistent formatting. DataSanitized mitigates these challenges by offering cleaned and pre-processed datasets accompanied by well-documented Jupyter Notebooks that detail the preprocessing methodologies and Exploratory Data Analysis (EDA) procedures.

This repository is designed to cater to a broad audience, including:

  • Data Scientists & Analysts – To leverage structured datasets for feature engineering and predictive modeling.
  • Machine Learning Practitioners – To obtain pre-processed data for supervised and unsupervised learning models.
  • AI Researchers – To access well-structured datasets for developing and benchmarking novel algorithms.
  • Academicians & Students – To facilitate hands-on learning in data preprocessing and analytics.

Repository Structure

The repository follows a standardized hierarchical structure to ensure uniformity and ease of use. Each dataset is encapsulated within a dedicated directory containing the following components:

/dataset_name/
│── raw_data.csv / raw_data.xlsx          # Original, unprocessed dataset
│── data_preprocessing.ipynb              # Jupyter Notebook with preprocessing pipeline
│── data_cleaned.csv / data_cleaned.xlsx  # Cleaned dataset post-preprocessing
│── data_eda.ipynb                        # Jupyter Notebook with EDA insights and visualizations

File Descriptions

  1. raw_data.csv / raw_data.xlsx – The original dataset obtained from a source before any modifications.
  2. data_preprocessing.ipynb – A Jupyter Notebook detailing the preprocessing pipeline, including:
    • Handling missing data (imputation, deletion, forward/backward fill).
    • Encoding categorical features (one-hot encoding, label encoding, ordinal encoding).
    • Removing duplicate entries and detecting anomalies.
    • Outlier detection and treatment (IQR method, Z-score, isolation forests).
    • Data transformation (scaling, normalization, standardization, log transformations).
    • Feature selection and engineering techniques.
  3. data_cleaned.csv / data_cleaned.xlsx – The dataset post-preprocessing, suitable for direct integration into machine learning workflows.
  4. data_eda.ipynb – A comprehensive EDA Jupyter Notebook that includes:
    • Descriptive statistical analysis (mean, median, variance, skewness, kurtosis).
    • Data visualization (histograms, density plots, box plots, scatter matrices).
    • Correlation heatmaps and feature importance evaluation.
    • Insights into dataset distribution, trends, and anomalies.

Dataset Sources

Datasets in this repository are procured from reputable open-source platforms, ensuring a diverse collection of high-quality data. The primary sources include but are not limited to:

  • Kaggle – A repository of publicly available datasets spanning diverse domains.
  • UCI Machine Learning Repository – A well-known dataset collection for benchmarking machine learning models.
  • Google Dataset Search – A meta-search engine for publicly accessible datasets.
  • Data.gov – An extensive repository of government datasets for research and analysis.
  • GitHub Open Datasets – A curated collection of publicly available datasets for data science applications.
  • OpenML – A collaborative platform for machine learning datasets and benchmarks.
  • Statista – A premium database for statistical insights and trend analysis.
  • World Bank Open Data – A collection of economic, demographic, and financial datasets.
  • Other publicly accessible government, academic, and enterprise-level datasets.

All datasets undergo stringent validation, preprocessing, and quality assurance before their inclusion in this repository.


Contribution Guidelines

Contributions to DataSanitized are highly encouraged. If you wish to contribute a dataset or enhance existing datasets, follow these structured steps:

  1. Fork this repository and create a new dataset folder within the main directory.
  2. Upload the raw dataset (.csv or .xlsx) along with a README.md explaining its source and structure.
  3. Implement preprocessing and EDA steps in respective Jupyter Notebooks (data_preprocessing.ipynb and data_eda.ipynb).
  4. Submit a Pull Request (PR) with a detailed summary of the modifications and enhancements.
  5. Ensure that all submissions adhere to standardized preprocessing methodologies and maintain uniformity in documentation.

How to Use

Clone the Repository

To start working with DataSanitized, clone the repository using:

 git clone https://github.com/yourusername/DataSanitized.git

Install Dependencies

Ensure that the required Python libraries are installed to execute preprocessing and EDA scripts:

 pip install pandas numpy matplotlib seaborn scikit-learn plotly missingno

Execute Jupyter Notebooks

Navigate to any dataset folder and run Jupyter Notebooks for preprocessing and exploratory analysis:

 jupyter notebook

Licensing and Attribution

All datasets, preprocessing scripts, and analysis notebooks within this repository are distributed under the MIT License. Users are encouraged to credit this repository if utilizing the datasets for research, academic purposes, or industrial applications.

If a dataset originates from an external source, attribution should be provided as per the dataset's original licensing terms.


Contact & Support

For inquiries, feedback, or collaboration opportunities, reach out via:

  • LinkedIn: Rayyan Ashraf
  • GitHub Issues: Open a new issue in the repository for support or feature requests.

Maintainer

Rayyan Ashraf – Curator and Maintainer of the DataSanitized Repository.


📌 Empowering data-driven research with high-quality, pre-processed datasets!

About

DataSanitized – A meticulously curated repository of pre-processed datasets, stripped of inconsistencies, anomalies, and redundancies. This resource provides high-quality, structured data suitable for advanced analysis, model training, and research applications, ensuring seamless integration into data pipelines and analytical workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published