Skip to content

Alessine/data-engineering-zoomcamp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Zoomcamp Cohort 2025

This repo contains all my materials, notes and homework for the Data Engineering Zoomcamp.

Module 1: Docker, SQL, Terraform

Learning in Public

I documented my learning in a Medium article.

Homework

Question 1. Understanding docker first run

Run docker with the python:3.12.8 image in an interactive mode, use the entrypoint bash. What's the version of pip in the image?

Answer:

In bash: docker run -it --entrypoint bash python:3.12.8 The image will run locally. To check the version of pip: pip --version. It is version 24.3.1.

Question 2. Understanding Docker networking and docker-compose

Given the following docker-compose.yaml, what is the hostname and port that pgadmin should use to connect to the postgres database?

Answer:

The container name with the postgres database is postgres, located at port 5432, so the answer is postgres:5432.

Question 3. Trip Segmentation Count

During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, respectively, happened:

  • Up to 1 mile
  • In between 1 (exclusive) and 3 miles (inclusive),
  • In between 3 (exclusive) and 7 miles (inclusive),
  • In between 7 (exclusive) and 10 miles (inclusive),
  • Over 10 miles

Answer:

Query:

SELECT
	CASE
		WHEN TRIP_DISTANCE <= 1 THEN '1: <1'
		WHEN TRIP_DISTANCE > 1
		AND TRIP_DISTANCE <= 3 THEN '2: 1-3'
		WHEN TRIP_DISTANCE > 3
		AND TRIP_DISTANCE <= 7 THEN '3: 3-7'
		WHEN TRIP_DISTANCE > 7
		AND TRIP_DISTANCE <= 10 THEN '4: 7-10'
		WHEN TRIP_DISTANCE > 10 THEN '5: 10+'
		ELSE 'unknown'
	END AS TRIP_DISTANCE_GROUP,
	COUNT(*) AS TRIP_COUNT
FROM
	GREEN_TAXI_TRIPS
WHERE
	DATE_TRUNC('day', LPEP_DROPOFF_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY
	TRIP_DISTANCE_GROUP;

Result:

query result for question 3

Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance? Use the pick up time for your calculations.

Tip: For every day, we only care about one single trip with the longest distance.

Answer:

Query:

SELECT
	DATE_TRUNC('day', LPEP_PICKUP_DATETIME) AS DATE,
	MAX(TRIP_DISTANCE) AS MAX_DISTANCE
FROM
	GREEN_TAXI_TRIPS
WHERE
	DATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY
	DATE_TRUNC('day', LPEP_PICKUP_DATETIME)
ORDER BY
	MAX_DISTANCE DESC
LIMIT
	1;

Result:

query result for question 4

Question 5. Three biggest pickup zones

Which were the top pickup locations with over 13,000 in total_amount (across all trips) for 2019-10-18?

Consider only lpep_pickup_datetime when filtering by date.

Answer:

Query:

SELECT
	ZONES."Zone",
	ROUND(CAST(TOTAL_AMOUNT_PER_ZONE AS NUMERIC), 2) AS TOTAL_AMOUNT_PER_ZONE
FROM
	(
		SELECT
			"PULocationID",
			SUM(TOTAL_AMOUNT) AS TOTAL_AMOUNT_PER_ZONE
		FROM
			GREEN_TAXI_TRIPS
		WHERE
			DATE_TRUNC('day', LPEP_PICKUP_DATETIME) = '2019-10-18'
		GROUP BY
			"PULocationID"
	) AS TOTAL_AMOUNT_AGG
	JOIN ZONES ON "PULocationID" = "LocationID"
WHERE
	TOTAL_AMOUNT_PER_ZONE > 13000;

Result:

query result for question 5

Question 6. Largest tip

For the passengers picked up in October 2019 in the zone name "East Harlem North" which was the drop off zone that had the largest tip?

Note: it's tip , not trip

We need the name of the zone, not the ID.

Answer:

Query:

SELECT
	DZ."Zone" AS "DOZone",
	MAX(TRIPS.TIP_AMOUNT) AS MAX_TIP
FROM
	GREEN_TAXI_TRIPS AS TRIPS
	LEFT JOIN ZONES AS PZ ON TRIPS."PULocationID" = PZ."LocationID"
	LEFT JOIN ZONES AS DZ ON TRIPS."DOLocationID" = DZ."LocationID"
WHERE
	DATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'
	AND PZ."Zone" = 'East Harlem North'
GROUP BY
	DZ."Zone"
ORDER BY
	MAX_TIP DESC
LIMIT
	1;

Result:

query result for question 6

Question 7. Terraform Workflow

Which of the following sequences, respectively, describes the workflow for:

  • Downloading the provider plugins and setting up backend,
  • Generating proposed changes and auto-executing the plan
  • Remove all resources managed by terraform`

Answer:

The required file is here.

The bash commands for the described workflow are the following:

$ terraform init
$ terraform apply -auto-approve
$ terraform destroy

Module 2: Orchestration with Kestra

Learning in Public

I'm documenting my learning in a Medium article - coming soon.

Homework

Question 1. File Size

Within the execution for Yellow Taxi data for the year 2020 and month 12: what is the uncompressed file size (i.e. the output file yellow_tripdata_2020-12.csv of the extract task)?

Answer:

In the GCS Bucket I can see that the uncompressed file size for the specified file is 128.3 MB.

file sizes in the GCS Bucket

Question 2. Rendered Value

What is the rendered value of the variable file when the inputs taxi is set to green, year is set to 2020, and month is set to 04 during execution?

Answer:

The variable file is defined as follows: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv". When rendered with the specified inputs, this generates the value green_tripdata_2020-04.csv. This is also visible in the GCS Bucket:

file names in the GCS Bucket

Question 3. Number of rows (yellow, 2020)

How many rows are there for the Yellow Taxi data for all CSV files in the year 2020?

Answer:

Query:

SELECT
  COUNT(*) AS row_count
FROM
  `dez-2025.taxi_data.yellow_tripdata`
WHERE
  filename LIKE "%2020%";

Result:

query result for question 3

Question 4. Number of rows (green, 2020)

How many rows are there for the Green Taxi data for all CSV files in the year 2020?

Answer:

Query:

SELECT
  COUNT(*) AS row_count
FROM
  `dez-2025.taxi_data.green_tripdata`
WHERE
  filename LIKE "%2020%";

Result:

query result for question 4

Question 5. Number of rows (yellow, March 2021)

How many rows are there for the Yellow Taxi data for the March 2021 CSV file?

Answer:

Query:

SELECT
  COUNT(*) AS row_count
FROM
  `dez-2025.taxi_data.yellow_tripdata`
WHERE
  filename = "yellow_tripdata_2021-03.csv";

Result:

query result for question 4

Question 6. Timezone for trigger

How would you configure the timezone to New York in a Schedule trigger?

Answer:

In the Kestra Documentation on Schedule Triggers we can find the following information:

Screenshot from the kestra documentation

About

Materials from the Data Engineering Zoomcamp 2025

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published