- Overview
- Dataset & License
- Project Structure
- Installation
- Usage
- Machine Learning Models
- Visualization
- Future Enhancements
- License
- Acknowledgements
- Contact
The objective of this project is to build a scalable and modular pipeline using Python for:
- Data Ingestion: Efficient extraction of a large CSV dataset in manageable chunks.
- Data Transformation: Cleaning, normalization, and type conversion of raw data.
- Data Storage: Loading processed data into a MySQL or MongoDB database with proper indexing.
- Machine Learning: Applying basic classification (e.g., predicting arrest outcomes), clustering (identifying crime hotspots), and regression (forecasting yearly crime counts) algorithms.
- Visualization: Generating summary statistics and plots for integration with a PowerBI dashboard. This solution is designed to support data-driven decision-making for public safety and urban management initiatives.
- Dataset: Crime Data from 2020 to Present (https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8/about_data) This dataset provides detailed records of crimes in Los Angeles from 2020 to the present.
- Dataset License:
The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. This means the data is free for public use without restrictions.
License Summary: The affirmer relinquishes all Copyright and Related Rights in the work, allowing the public to use, modify, distribute, and build upon the work—even for commercial purposes—without the need for permission. For the complete legal text, please refer to the CC0 1.0 Universal Legal Code (https://creativecommons.org/publicdomain/zero/1.0/legalcode).
los-angeles-crime-data-analytics/
├── config.py # Project configuration (paths, database credentials)
├── database.py # Database connection & table creation module
├── transform.py # Data cleaning & transformation routines
├── etl.py # ETL pipeline: extraction, transformation, loading
├── model.py # Machine Learning models: classification, clustering, regression
├── visualization.py # Visualization routines for summary statistics and plots
├── main.py # Main script to execute the complete pipeline
├── data/
│ ├── crime_data.csv # Raw dataset (to be placed here)
│ ├── raw/ # Optional: for intermediate raw data files
│ └── processed/ # Processed outputs (CSV exports, plots)
└── logs/
└── project.log # Log file capturing errors and pipeline progress
- Clone the repository:
git clone https://github.com/yourusername/los-angeles-crime-data-analytics.git cd los-angeles-crime-data-analytics
- Set up a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the required dependencies:
Dependencies include: pandas, sqlalchemy, pymysql, scikit-learn, matplotlib, etc.
pip install -r requirements.txt
- Configure the project:
- Update
config.py
with your database credentials and paths. - Ensure your database (MySQL or MongoDB) is running and accessible.
- Update
- Dataset Placement:
Place the downloaded CSV file from the Los Angeles dataset into thedata/
directory and rename it (if necessary) tocrime_data.csv
. - Run the Pipeline:
Execute the main script to start the ETL process, run machine learning models, and generate visual outputs:python main.py
- Monitor Logs:
Check thelogs/project.log
file for detailed logging information and error messages. - Dashboard Integration:
Processed output files (e.g., summary CSVs and plots) indata/processed/
can be imported into PowerBI for interactive visualization.
- Classification:
Predicts the likelihood of an arrest using features such as year, location coordinates, and other attributes. - Clustering:
Uses clustering algorithms (e.g., KMeans) to identify and segment crime hotspots. - Regression:
Forecasts crime counts over time (e.g., per year) using regression models such as RandomForestRegressor. Each model is implemented inmodel.py
with appropriate evaluation metrics (accuracy, MSE, silhouette score).
The visualization.py
module generates:
- Summary Statistics:
CSV files containing descriptive statistics. - Trend Plots:
Graphs (e.g., line charts) visualizing crime trends over time. - Clustered Data Export:
CSV exports of clustered records for geospatial analysis. These outputs are intended for use in PowerBI to create interactive dashboards.
- Scalability:
Adapt the ETL pipeline for distributed processing using frameworks like Apache Spark. - Real-Time Analytics:
Integrate streaming data sources for near real-time analysis. - Advanced Analytics:
Incorporate deep learning models for anomaly detection and more sophisticated predictions. - Cloud Deployment:
Explore containerization (Docker) and orchestration (Kubernetes) for scalable, cloud-based deployment. - Enhanced Dashboarding:
Integrate additional visualization tools for richer interactivity and geospatial mapping.
This project is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. By using this repository, you agree that the project’s code and documentation are provided "as-is" without any warranties. For a full description of your rights under this license, please see the CC0 1.0 Universal Legal Code.
- Dataset Provider:
Los Angeles Data Portal – Crime Data from 2020 to Present - Industry Standards:
The project is inspired by best practices in big data analytics, ETL pipeline design, and machine learning, following guidelines from PMI and ISO 21500.
For questions or further information, please contact:
- Name: Krishna Jodha
- Email: work.noah14@gmail.com
This README is maintained as a living document and will be updated as the project evolves.