This project implements an ETL (Extract, Transform, Load) process to migrate and transform data from an OLTP (Online Transaction Processing) system to a star schema in a data warehouse. The ETL process is written in Kotlin and Spark, and it is orchestrated using Apache Airflow.
-
ETL Processing of MySQL tables:
- T_CATEGORY: product categories.
- T_CUSTOMER: customer data and their addresses.
- T_ORDER: information about orders.
- T_ORDER_REL: information about products in orders.
- T_PRODUCT: Prosses product information.
- T_PROMO and T_PROMO_REL: information about promotions and the products affected by them.
-
Data Transformation:
- Transforms data from OLTP format to a star schema suitable for analytical queries.
-
Orchestration:
- Utilizes Apache Airflow for scheduling and managing the ETL workflows.
- Programming Language: Kotlin
- Data Processing Framework: Apache Spark v3.3.2
- Workflow Orchestration: Apache Airflow
- Kotlin: Ensure you have Kotlin installed. Install Kotlin.
- Apache Spark v3.3.2: Ensure you have Apache Spark installed. Install Spark.
- Apache Airflow: Ensure you have Apache Airflow installed. Install Airflow.
- Clone the repository:
git clone https://github.com/MFurmanczyk/wh-sales.git
cd wh-sales
- Build the project:
./gradlew shadowJar
- Setup Apache Airflow:
cd airflowdocker-compose up
- Running the ETL Start the Apache Airflow web server and scheduler:
docker-compose up -d
- Move
dag.py
to Airflow's DAGs folder.
Access the Airflow UI at http://localhost:8080 and trigger the ETL DAG (sales_dag).
This project is licensed under the MIT License - see the LICENSE file for details.