Skip to content

Proof of concept of a big data cluster using open source tools

License

Notifications You must be signed in to change notification settings

lrmendess/open-source-datalake

Repository files navigation

Open Source Data Lake

Actions Result

This project creates a Big Data cluster using open source tools and it was designed for learning purposes.

Feel free to study or replicate the content here.

Prerequisites 🛠️

  • Docker >= 25.0.4
  • Docker Compose >= 2.24.7
  • OpenTofu >= 1.6.2
  • 8GB of RAM or more

Available services ☑️

  • Airflow
  • Apache Spark
  • Hive Metastore (HMS)
  • Metabase
  • MinIO
  • OpenTofu
  • Trino

Project Architecture 🏯

Below we have an illustration of the project architecture, presenting the available services and their interactions.

Architecture

Getting started 🚀

Below we have a step-by-step guide on how to perform the services available in the project.

Environment variables

Copy the .env.default file to .env and fill in the blank variables.

  • POSTGRES_USER
  • POSTGRES_PASSWORD
  • MINIO_ROOT_USER
  • MINIO_ROOT_PASSWORD (requires a moderate password)

Common services

Create a network named datalake-network (only on first run).

docker network create datalake-network

Initialize all cluster services (This step may take a while to complete on the first run).

docker compose up -d [--scale trino-worker=<num>] [--scale spark-worker=<num>]

After the first startup, if you stop the service and want to start it again, you must prefix the variable IS_RESUME=true when invoking the docker-compose up command again. This will prevent HMS from trying to recreate your database once your volume already exists.

Create MinIO Buckets (only on first run).

tofu init
tofu plan
tofu apply -auto-approve

Create Data Lake schemas (only on first run).

docker container exec datalake-trino-coordinator trino --execute "$(cat trino/schemas.sql)"

If you receive warnings that Trino is still initializing, wait a few seconds and try again.

Now the cluster is ready!

Airflow (optional)

Airflow was chosen as a tool for job orchestration, for the simple reason that it is open source and the most common on the market.

First you need to run the database migrations and create a user account. Just run the command below:

docker compose -f docker-compose.airflow.yml up airflow-init

If you want to customize the Airflow project directory, simply update the AIRFLOW_PROJ_DIR variable in the .env to a directory of interest.

The default directory is ./workspace

Once airflow-init is finished, we can actually run Apache Airflow.

docker compose -f docker-compose.airflow.yml up -d

Access the URL http://localhost:8080 in your browser and log in using the default authentication airflow:airflow.

For more details, see the official Airflow documentation.

Metabase (optional)

Metabase was arbitrarily chosen to create dashboards with BI tools.

In the metabase/metabase.db directory, copy the file metabase.db.mv.db.default to metabase.db.mv.db. This way, Metabase will be initialized with the proof-of-concept dashboard already created, but if you want to deploy it cleanly, skip this step.

Run the Metabase container with the command below:

docker compose -f docker-compose.metabase.yml up -d

Access the URL http://localhost:3000 in your browser and log in using the default authentication johnny@email.com:j0hnny or create your own account..

If you want to create a new connection with Trino, use the settings below:

  • Database Type: Starburst
  • Display Name: Your choice
  • Host: datalake-trino-coordinator
  • Port: 8080
  • Catalog: hive
  • Schema (optional): raw|trusted|refined
  • Username: metabase
  • Password: no password is required

For more details, see the official Metabase documentation.

Access links to service interfaces

Service URL Auth
Metabase http://localhost:3000 johnny@email.com:j0hnny
Airflow http://localhost:8080 airflow:airflow
Trino UI http://localhost:8081 Any username, no password is required
Spark UI http://localhost:8082 None
MinIO http://localhost:9001 ${MINIO_ROOT_USER}:${MINIO_ROOT_PASSWORD}

Submit Spark jobs and use the Trino CLI ✨

To submit Spark jobs, you can use Airflow itself, which already has a customized operator for this PySparkDockerOperator.

But it is also possible to submit Spark jobs manually, by uploading the artifacts (source code and dependencies) to MinIO and executing the following command.

docker run -t --rm [-e <key>=<value>] --network datalake-network datalake-spark-image <spark-submit-command>

Template of a spark-submit command, considering that some settings are pre-defined in spark-defaults.conf.

spark-submit \
--name <job-name> \
[--conf <key>=<value>] \
... # other options
<s3-application-py> \
[application-args]

For more details, see the official Submitting Applications documentation.

Example of use:

# Build dependencies
cd workspace/project
pip install -r requirements.txt -t pyfiles
cd pyfiles && zip -r pyfiles.zip *

# Upload artifacts to s3a://artifacts/salario_minimo using minio ui

# Spark job 1
docker run -t --rm -e BUCKET_DATALAKE_LANDING=datalake-landing --network datalake-network datalake-spark-image \
spark-submit \
--name raw_tb_salario_minimo \
s3a://datalake-artifacts/salario_minimo/pyspark_raw_salario_minimo.py

# Spark job 2
docker run -t --rm --network datalake-network datalake-spark-image \
spark-submit \
--name trusted_tb_salario_minimo \
--py-files s3a://datalake-artifacts/salario_minimo/pyfiles.zip \
s3a://datalake-artifacts/salario_minimo/pyspark_trusted_salario_minimo.py

# Check the status of running jobs at http://localhost:8082

Check the results using the Trino CLI.

trino --server http://localhost:8081 --catalog hive
trino> select * from trusted.tb_salario_minimo limit 3;
#  vl_salario_minimo | nu_mes | nu_ano 
# -------------------+--------+--------
#              937.0 |      1 |   2017 
#             1039.0 |      1 |   2020 
#             1045.0 |      2 |   2020
# (3 rows)

Proof of Concept Project 🦄

In the workspace/project directory, we have a project that serves as a demonstration of how the Cluster and Data Lake works. It consists of a simple data pipeline for ingestion, curation and refinement, demonstrating the interaction between Spark, Hive and MinIO.

Through the DAG dag.poder_compra, the necessary artifacts for executing Spark jobs are uploaded (including the source code files and their dependencies) and all stages of the pipeline are executed, resulting in a final table called trusted.tb_poder_compra.

Airflow DAG

Once table trusted.tb_poder_compra is created, we can explore it using Metabase.

Metabase Dashboard

That's all, folks!

Future works 🔮

  • Improve access control for users and services;
  • Use a password vault instead of a simple .env file;
  • Use Kubernetes to orchestrate containers.

About

Proof of concept of a big data cluster using open source tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published