Skip to content

This repository hosts a comprehensive analytics pipeline for processing and analyzing Los Angeles Crime Data (2020–Present). It covers extraction, cleaning, ETL, basic machine learning, and visualization, with outputs integrated into PowerBI for interactive dashboards.

Notifications You must be signed in to change notification settings

N0AH-14/Chicago-Crime

Repository files navigation

Table of Contents

  1. Overview
  2. Dataset & License
  3. Project Structure
  4. Installation
  5. Usage
  6. Machine Learning Models
  7. Visualization
  8. Future Enhancements
  9. License
  10. Acknowledgements
  11. Contact

Overview

The objective of this project is to build a scalable and modular pipeline using Python for:

  • Data Ingestion: Efficient extraction of a large CSV dataset in manageable chunks.
  • Data Transformation: Cleaning, normalization, and type conversion of raw data.
  • Data Storage: Loading processed data into a MySQL or MongoDB database with proper indexing.
  • Machine Learning: Applying basic classification (e.g., predicting arrest outcomes), clustering (identifying crime hotspots), and regression (forecasting yearly crime counts) algorithms.
  • Visualization: Generating summary statistics and plots for integration with a PowerBI dashboard. This solution is designed to support data-driven decision-making for public safety and urban management initiatives.

Dataset & License

  • Dataset: Crime Data from 2020 to Present (https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8/about_data) This dataset provides detailed records of crimes in Los Angeles from 2020 to the present.
  • Dataset License: The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. This means the data is free for public use without restrictions.

    License Summary: The affirmer relinquishes all Copyright and Related Rights in the work, allowing the public to use, modify, distribute, and build upon the work—even for commercial purposes—without the need for permission. For the complete legal text, please refer to the CC0 1.0 Universal Legal Code (https://creativecommons.org/publicdomain/zero/1.0/legalcode).


Project Structure

los-angeles-crime-data-analytics/
├── config.py              # Project configuration (paths, database credentials)
├── database.py            # Database connection & table creation module
├── transform.py           # Data cleaning & transformation routines
├── etl.py                 # ETL pipeline: extraction, transformation, loading
├── model.py               # Machine Learning models: classification, clustering, regression
├── visualization.py       # Visualization routines for summary statistics and plots
├── main.py                # Main script to execute the complete pipeline
├── data/
│   ├── crime_data.csv     # Raw dataset (to be placed here)
│   ├── raw/               # Optional: for intermediate raw data files
│   └── processed/         # Processed outputs (CSV exports, plots)
└── logs/
    └── project.log        # Log file capturing errors and pipeline progress

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/los-angeles-crime-data-analytics.git
    cd los-angeles-crime-data-analytics
  2. Set up a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate   # On Windows: venv\Scripts\activate
  3. Install the required dependencies:
    pip install -r requirements.txt
    Dependencies include: pandas, sqlalchemy, pymysql, scikit-learn, matplotlib, etc.
  4. Configure the project:
    • Update config.py with your database credentials and paths.
    • Ensure your database (MySQL or MongoDB) is running and accessible.

Usage

  1. Dataset Placement:
    Place the downloaded CSV file from the Los Angeles dataset into the data/ directory and rename it (if necessary) to crime_data.csv.
  2. Run the Pipeline:
    Execute the main script to start the ETL process, run machine learning models, and generate visual outputs:
    python main.py
  3. Monitor Logs:
    Check the logs/project.log file for detailed logging information and error messages.
  4. Dashboard Integration:
    Processed output files (e.g., summary CSVs and plots) in data/processed/ can be imported into PowerBI for interactive visualization.

Machine Learning Models

  • Classification:
    Predicts the likelihood of an arrest using features such as year, location coordinates, and other attributes.
  • Clustering:
    Uses clustering algorithms (e.g., KMeans) to identify and segment crime hotspots.
  • Regression:
    Forecasts crime counts over time (e.g., per year) using regression models such as RandomForestRegressor. Each model is implemented in model.py with appropriate evaluation metrics (accuracy, MSE, silhouette score).

Visualization

The visualization.py module generates:

  • Summary Statistics:
    CSV files containing descriptive statistics.
  • Trend Plots:
    Graphs (e.g., line charts) visualizing crime trends over time.
  • Clustered Data Export:
    CSV exports of clustered records for geospatial analysis. These outputs are intended for use in PowerBI to create interactive dashboards.

Future Enhancements

  • Scalability:
    Adapt the ETL pipeline for distributed processing using frameworks like Apache Spark.
  • Real-Time Analytics:
    Integrate streaming data sources for near real-time analysis.
  • Advanced Analytics:
    Incorporate deep learning models for anomaly detection and more sophisticated predictions.
  • Cloud Deployment:
    Explore containerization (Docker) and orchestration (Kubernetes) for scalable, cloud-based deployment.
  • Enhanced Dashboarding:
    Integrate additional visualization tools for richer interactivity and geospatial mapping.

License

This project is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. By using this repository, you agree that the project’s code and documentation are provided "as-is" without any warranties. For a full description of your rights under this license, please see the CC0 1.0 Universal Legal Code.

Acknowledgements

  • Dataset Provider:
    Los Angeles Data Portal – Crime Data from 2020 to Present
  • Industry Standards:
    The project is inspired by best practices in big data analytics, ETL pipeline design, and machine learning, following guidelines from PMI and ISO 21500.

Contact

For questions or further information, please contact:


This README is maintained as a living document and will be updated as the project evolves.

About

This repository hosts a comprehensive analytics pipeline for processing and analyzing Los Angeles Crime Data (2020–Present). It covers extraction, cleaning, ETL, basic machine learning, and visualization, with outputs integrated into PowerBI for interactive dashboards.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages