This project consists of running a basic and dockerized data engineering development and learning environment using a set of tools:
- PySpark (3.3.2)
- MinIO (AGPL v3) 🦩
- Jupyter Lab
The only purpose of this project is learning, so never use this for production.
To execute this project, a linux OS (including WSL) and a docker compose v2 are required.
First, you need to build a docker image by typing make build
. After that, type make start
every time you want to start the service.
After the build and start process is done, type make token
and copy the result.
Access http://localhost:8888, paste the token in text/password field and submit. If everything is right, now you have access to Jupyter Lab and can create a python scripts normally.
Access http://localhost:9000 and sign-in using these credentials:
- username: root
- passsword: root@password
Now you can create your own buckets to save and manipulete files like a AWS S3 🍷.
Access http://localhost:8080 to inspect PySpark applications and workers (by default, the docker-compose.yml
is configured to run 2 PySpark workers with 1 vCore and 2GB of memory each).
To inspect the running stages, you can access http://localhost:4040 during execution.
To stop all containers, type make stop
in the terminal and wait for them all to be brought down.
An example using PySpark and MinIO through Jupyter is available at workspace/sample.ipynb.
When running the containers for the first time, a workspace/
directory will be created at the root of the project. This folder is shared between host machine and jupyter workspace running inside the container. The buckets/
directory is where MinIO persists the data that is generated.
This approach guarantees us that even if a container is restarted, all the code and data that has already been created will persist.