Skip to content

Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.

Notifications You must be signed in to change notification settings

derak-isaack/NYC-Taxi-Analytics

Repository files navigation

NEW YORK CITY TLC TAXI DATA PIPELINE

dbt Duckdb Apache-parquet

Project Overview

This is an ETL(Extract-Transform-Load) data pipeline using DUCKDB for extraction & loading and DBT for transformation. The data to be transformed is from the NYC-TLC-website for the month of May 2024. The data columns description can be found here.

Objectives

The transformation objectives include building various transformation models for further analysis.

  1. Route traffic model using the Pick-Up & Drop-off locations.

  2. Hourly daily Server outage model using the Drop-Off location. Look for drop off locations that are prone to server outages in terms of sending trip detals to the server. They are marked as N. This is for analysis to get which hours of the day are mostly affected by severe server outages and might need further action.

  3. Tip amount model. Analyze the tips by different customers to different vendors.

  4. Daily-hourly traffic model. Model to analyze passenger count for every 24hrs per day for further passenger trend analysis.

  5. Pick-up trend model for assesing passenger counts in various pick up locations.

Data Extraction & Loading

For the data extraction, DuckDB has extensive options for performing data extraction explicitly. Of importance is to use the fetch_df() in the SQL queries when seeking to find the data structure and format. How the final transformation models would look like can be found here.

DuckDB will also be used for Loading the transformed data in table formart as will be defined in the transformation models using the `{{config(materialized='table')}} command.

Data Transformation

For the transformation, DBT(Data Build Tool) comes in very handy in handling the transformation logic using the normal SQL syntax. The transformation-models are all chained to the first model for further analysis of the data.

To initialize a DBT project together with DuckDB OLAP database, the following commands are to be performed in order.

  • pip install dbt-duckdb

  • dbt init

  • dbt debug to test that everything is working fine before proceeding.

  • dbt run after defining the transformation models. Incase of any error the logs should be checked.

For a succesfull dbt model, the following should be printed on the terminal:

dbt-final-screenshot

Dashboard

The transformation models are then visualized using Power BI which offers quick interactive visualization charts with the key KPI's.

Dashboard

Prefect Integration

Prefect-dbt-flow library offers quick simple integration with orchestration and pipeline monitoring.

Orchestration-prefect

About

Data engineering ETL project using OLAP databases and DBT to perform analysis on NYC taxi data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published