This repo contains all my materials, notes and homework for the Data Engineering Zoomcamp.
I documented my learning in a Medium article.
Question 1. Understanding docker first run
Run docker with the python:3.12.8 image in an interactive mode, use the entrypoint bash. What's the version of pip in the image?
Answer:
In bash: docker run -it --entrypoint bash python:3.12.8
The image will run locally. To check the version of pip: pip --version
. It is version 24.3.1
.
Question 2. Understanding Docker networking and docker-compose
Given the following docker-compose.yaml, what is the hostname and port that pgadmin should use to connect to the postgres database?
Answer:
The container name with the postgres database is postgres
, located at port 5432
, so the answer is postgres:5432
.
Question 3. Trip Segmentation Count
During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, respectively, happened:
- Up to 1 mile
- In between 1 (exclusive) and 3 miles (inclusive),
- In between 3 (exclusive) and 7 miles (inclusive),
- In between 7 (exclusive) and 10 miles (inclusive),
- Over 10 miles
Answer:
Query:
SELECT
CASE
WHEN TRIP_DISTANCE <= 1 THEN '1: <1'
WHEN TRIP_DISTANCE > 1
AND TRIP_DISTANCE <= 3 THEN '2: 1-3'
WHEN TRIP_DISTANCE > 3
AND TRIP_DISTANCE <= 7 THEN '3: 3-7'
WHEN TRIP_DISTANCE > 7
AND TRIP_DISTANCE <= 10 THEN '4: 7-10'
WHEN TRIP_DISTANCE > 10 THEN '5: 10+'
ELSE 'unknown'
END AS TRIP_DISTANCE_GROUP,
COUNT(*) AS TRIP_COUNT
FROM
GREEN_TAXI_TRIPS
WHERE
DATE_TRUNC('day', LPEP_DROPOFF_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY
TRIP_DISTANCE_GROUP;
Result:
Question 4. Longest trip for each day
Which was the pick up day with the longest trip distance? Use the pick up time for your calculations.
Tip: For every day, we only care about one single trip with the longest distance.
Answer:
Query:
SELECT
DATE_TRUNC('day', LPEP_PICKUP_DATETIME) AS DATE,
MAX(TRIP_DISTANCE) AS MAX_DISTANCE
FROM
GREEN_TAXI_TRIPS
WHERE
DATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY
DATE_TRUNC('day', LPEP_PICKUP_DATETIME)
ORDER BY
MAX_DISTANCE DESC
LIMIT
1;
Result:
Question 5. Three biggest pickup zones
Which were the top pickup locations with over 13,000 in total_amount (across all trips) for 2019-10-18?
Consider only lpep_pickup_datetime when filtering by date.
Answer:
Query:
SELECT
ZONES."Zone",
ROUND(CAST(TOTAL_AMOUNT_PER_ZONE AS NUMERIC), 2) AS TOTAL_AMOUNT_PER_ZONE
FROM
(
SELECT
"PULocationID",
SUM(TOTAL_AMOUNT) AS TOTAL_AMOUNT_PER_ZONE
FROM
GREEN_TAXI_TRIPS
WHERE
DATE_TRUNC('day', LPEP_PICKUP_DATETIME) = '2019-10-18'
GROUP BY
"PULocationID"
) AS TOTAL_AMOUNT_AGG
JOIN ZONES ON "PULocationID" = "LocationID"
WHERE
TOTAL_AMOUNT_PER_ZONE > 13000;
Result:
Question 6. Largest tip
For the passengers picked up in October 2019 in the zone name "East Harlem North" which was the drop off zone that had the largest tip?
Note: it's tip , not trip
We need the name of the zone, not the ID.
Answer:
Query:
SELECT
DZ."Zone" AS "DOZone",
MAX(TRIPS.TIP_AMOUNT) AS MAX_TIP
FROM
GREEN_TAXI_TRIPS AS TRIPS
LEFT JOIN ZONES AS PZ ON TRIPS."PULocationID" = PZ."LocationID"
LEFT JOIN ZONES AS DZ ON TRIPS."DOLocationID" = DZ."LocationID"
WHERE
DATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
AND PZ."Zone" = 'East Harlem North'
GROUP BY
DZ."Zone"
ORDER BY
MAX_TIP DESC
LIMIT
1;
Result:
Question 7. Terraform Workflow
Which of the following sequences, respectively, describes the workflow for:
- Downloading the provider plugins and setting up backend,
- Generating proposed changes and auto-executing the plan
- Remove all resources managed by terraform`
Answer:
The required file is here.
The bash commands for the described workflow are the following:
$ terraform init
$ terraform apply -auto-approve
$ terraform destroy
I'm documenting my learning in a Medium article - coming soon.
Question 1. File Size
Within the execution for Yellow
Taxi data for the year 2020
and month 12
: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv
of the extract task)?
Answer:
In the GCS Bucket I can see that the uncompressed file size for the specified file is 128.3 MB.
Question 2. Rendered Value
What is the rendered value of the variable file
when the inputs taxi
is set to green
, year
is set to 2020
, and month
is set to 04
during execution?
Answer:
The variable file
is defined as follows: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
. When rendered with the specified inputs, this generates the value green_tripdata_2020-04.csv
. This is also visible in the GCS Bucket:
Question 3. Number of rows (yellow, 2020)
How many rows are there for the Yellow
Taxi data for all CSV files in the year 2020?
Answer:
Query:
SELECT
COUNT(*) AS row_count
FROM
`dez-2025.taxi_data.yellow_tripdata`
WHERE
filename LIKE "%2020%";
Result:
Question 4. Number of rows (green, 2020)
How many rows are there for the Green
Taxi data for all CSV files in the year 2020?
Answer:
Query:
SELECT
COUNT(*) AS row_count
FROM
`dez-2025.taxi_data.green_tripdata`
WHERE
filename LIKE "%2020%";
Result:
Question 5. Number of rows (yellow, March 2021)
How many rows are there for the Yellow
Taxi data for the March 2021 CSV file?
Answer:
Query:
SELECT
COUNT(*) AS row_count
FROM
`dez-2025.taxi_data.yellow_tripdata`
WHERE
filename = "yellow_tripdata_2021-03.csv";
Result:
Question 6. Timezone for trigger
How would you configure the timezone to New York in a Schedule trigger?
Answer:
In the Kestra Documentation on Schedule Triggers we can find the following information: