This is an ETL(Extract-Transform-Load
) data pipeline using DUCKDB
for extraction & loading and DBT
for transformation. The data to be transformed is from the NYC-TLC-website for the month of May 2024. The data columns description can be found here.
The transformation objectives include building various transformation models for further analysis.
-
Route traffic model
using thePick-Up
&Drop-off
locations. -
Hourly daily Server outage model
using theDrop-Off
location. Look fordrop off
locations that are prone to server outages in terms of sending trip detals to the server. They are marked asN
. This is for analysis to get which hours of the day are mostly affected by severe server outages and might need further action. -
Tip amount model
. Analyze the tips by different customers to different vendors. -
Daily-hourly traffic model
. Model to analyze passenger count forevery 24hrs per day
for further passenger trend analysis. -
Pick-up trend model
for assesing passenger counts in various pick up locations.
For the data extraction, DuckDB has extensive options for performing data extraction
explicitly. Of importance is to use the fetch_df()
in the SQL
queries when seeking to find the data structure and format. How the final transformation models would look like can be found here.
DuckDB will also be used for Loading the transformed data in table formart as will be defined in the transformation models
using the `{{config(materialized='table')}} command.
For the transformation, DBT(Data Build Tool) comes in very handy in handling the transformation logic using the normal SQL
syntax. The transformation-models are all chained to the first model for further analysis of the data.
To initialize a DBT project
together with DuckDB OLAP database
, the following commands are to be performed in order.
-
pip install dbt-duckdb
-
dbt init
-
dbt debug
to test that everything is working fine before proceeding. -
dbt run
after defining the transformation models. Incase of any error thelogs
should be checked.
For a succesfull dbt model, the following should be printed on the terminal:
The transformation models are then visualized using Power BI
which offers quick interactive visualization charts with the key KPI's
.
Prefect-dbt-flow
library offers quick simple integration with orchestration and pipeline monitoring.