This repository contains the flows and deployments for the data pipelines. The flows are written in Python using the Prefect 2 library.
In order to contribute to this repository, you will need to have a basic understanding of the Prefect library and how it works. Please find more information around how Prefect works in the Prefect documentation.
-
Clone the repository using git
git clone https://github.com/Mmoncadaisla/data-flows.git
-
Set up your local environment using docker-compose (see Getting started with local Development)
-
Create a new branch from the
development
branch (e.g.YourUsername/feature/my-new-feature
)git checkout -b YourUsername/feature/my-new-feature
-
Create your flow in the
dataflow
package and run it locally (see Running flows locally)NOTE: If your flow requires a new dependency, add it to the
setup.py
file, requirements will be automatically added to therunner
andagent
containers. -
Once you are happy with your flow, make sure you add unit tests for it within the
tests
directory (tests are designed usingpytest
, please see Testing for reference) -
If desired, add a deployment definition file to the
deployments
module (e.g:flow_name_deployment.py
). It is recommended to use parameters to have a separate deployment for production and staging environments. For example (find the full example here):def deploy_hello_flow(): deployment = Deployment.build_from_flow( flow=hello_flow, name="hello_flow_deployment", work_queue_name="default", storage=storage, path="hello_flow", tags=["staging"], infrastructure=infrastructure, schedule=CronSchedule(cron="0 12 1 * *", timezone="UTC"), ) deployment.apply()
The files to be committed at this point would be:
❯ git status On branch Mmoncadaisla/add-new-flow Untracked files: (use "git add <file>..." to include in what will be committed) dataflows/deployments/new_deployment.py dataflows/flows/new_flow.py tests/unit/flows/test_new_flow.py nothing added to commit but untracked files present (use "git add" to track)
-
If you have added a deployment definition file to the
deployments
module, add it to thedeployments._constants
module so that the CI can deploy your flow automatically upon flow/deployment file changes.from dataflows.deployments.hello_deployment import deploy_hello_flow from dataflows.deployments.my_new_deployment import deploy_new_flow FLOW_DEPLOYMENT_DICT = flow_deployment_dict = { "hello_flow": deploy_hello_flow, "new_flow": deploy_new_flow } DEPLOYMENT_FILE_FUNC_DICT = { "hello_deployment": deploy_hello_flow, "new_deployment": deploy_new_flow }
The files to be committed after modifying
_constants.py
:❯ git status On branch Mmoncadaisla/add-new-flow Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: dataflows/deployments/_constants.py Untracked files: (use "git add <file>..." to include in what will be committed) dataflows/deployments/new_deployment.py dataflows/flows/new_flow.py tests/unit/flows/test_new_flow.py no changes added to commit (use "git add" and/or "git commit -a")
git add --all git commit -m "Add new flow" git push --set-upstream origin Mmoncadaisla/add-new-flow
After the changes are pushed, the CI will run the unit tests and the linter. If the tests pass, you can create a pull request against the
development
branch. -
Open a pull request against the
development
branch and add at least one reviewer. When the PR is created, the CI will run the unit tests and the linter. -
After tests have passed and the PR has been approved, merge the PR into the
development
branch and delete your branch. If a deployment file has been created, and the proper references were added to the_constants
module, it will automatically deploy your flow to Prefect Cloud!You can check this from the GitHub actions tab.
-
Congratulations! You have successfully contributed to the data pipelines! 🚀