Não fala Inglês? Clique aqui to view para ver essa página em Português
This project is a solution for data extraction, transformation, and loading (ELT) using Airflow, Meltano, Streamlit, and PostgreSQL. It allows extracting data from different sources, loading it into a PostgreSQL database, and visualizing the results in a Streamlit application.
It includes the Pre-Commit framework to manage and maintain pre-commit hooks, ensuring code that follows standards established by the Python community.
- Prerequisites
- Project Architecture
- Setup
- Accessing Services
- Running Meltano in the Terminal
- Stopping Services
- Troubleshooting
- Contribution
- Docker
- Docker Compose
- Make (optional but recommended)
JSONL (JSON Lines) was selected as the data storage format for this project because it is more flexible and works well in modern pipelines. Here are the main reasons:
- Simple Structure: It allows storing complex and nested data, such as lists and objects, without "flattening" the data.
- Stream Processing: Each line is an independent JSON object, making it suitable for processing large datasets without loading everything into memory.
- Compatibility: It is widely supported by modern ETL tools and APIs, making integration easier.
- Easy Debugging: An error in one line does not affect the entire file, making it easier to identify and fix issues.
-
Clone the repository:
git@github.com:Robso-creator/elt_meltano_ind.git cd elt_meltano_ind
-
Create the
.env
file in the project root with the following content:POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres
-
Check the Docker version to ensure it is installed correctly:
docker --version
-
Build the Streamlit image and start the containers:
make build # Build the Streamlit image make up # Start the containers
-
Airflow: Access localhost:8080 to manage and execute the DAGs that extract data from sources and load it into the database.
-
Streamlit: Access localhost:8501 to view the Streamlit application with the processed results.
To run Meltano directly in the terminal, use the following commands:
make enter-local
SOURCE=postgres YEAR=2025 MONTH=01 DAY=03 meltano run extract-postgres-to-jsonl
SOURCE=csv YEAR=2025 MONTH=01 DAY=03 meltano run extract-csv-to-jsonl
YEAR=2025 MONTH=01 DAY=03 meltano run load-jsonl-to-postgres
To stop the containers, use the following command:
make down # Stop the containers
make rm # Remove stopped containers and volumes
If you cannot access the Airflow page and find the error Already running on PID <PID>
in make logs-webserver
, follow the steps below:
-
Stop the containers:
make down
-
Check if any process is using port 8080:
sudo lsof -i tcp:8080
-
If there is a process, kill it:
sudo kill -9 PID
-
Remove the Airflow PID file:
sudo rm -rf meltano/orchestrate/airflow-webserver.pid
-
Restart the containers:
make up
Contributions are welcome! Feel free to open issues and pull requests.