forked from astronomer/astronomer-cosmos
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use
dbtRunner
in the DAG Processor when using LoadMode.DBT_LS
if …
…`dbt-core` is available (astronomer#1484) This PR significantly improves Cosmos resource utilisation in the scheduler DAG Processor and in the worker nodes when dynamically converting dbt workflows into Airflow DAGs using the `LoadMode.DBT_LS`. It introduces support to use `dbtRunner` during DAG parsing if `dbt-core` and its adaptors are in the same Python virtualenv as Airflow. This change is particularly relevant given the way Airflow (2.x) parses DAGs not only as part of the scheduler loop but also whenever a task executes: <img width="845" alt="Screenshot 2025-01-23 at 13 51 40" src="https://github.com/user-attachments/assets/90398307-a26c-4cbd-ae44-a6a5c1e0e98e" /> > Diagram extracted from the talk @pankajkoti and I gave at Airflow Summit 2024 > https://airflowsummit.org/sessions/2024/overcoming-performance-hurdles-in-integrating-dbt-with-airflow/) When using `LoadMode.DBT_LS`, Cosmos runs `dbt ls` whenever Airflow parses the DAG (in case of a cache miss). Suppose there is a Cosmos `DbtDag` with 200 concurrent tasks. If the dbt project changes, when Airflow and Cosmos attempt to parse the `DbtDag`, they will invalidate the dbt ls cache. If 200 Cosmos tasks execute concurrently when there is a cache miss, they will all run the same `dbt ls` command. Until Cosmos 1.8, Cosmos would always create a subprocess for each command. If 200 tasks execute in a worker node, this would represent 400 processes attempting to run concurrently, leading to a vast resource CPU - and potentially memory - spike and Out of Memory (OOM) errors. While this change does not avoid the 200 tasks attempting to run `dbt ls` concurrently, it avoids each of them creating an additional subprocess - optimising the resource utilisation. This change is heavily influenced by changes (astronomer#850) previously made by @jbandoro, who added support for Cosmos to use `dbtRunner` to execute dbt commands in the Airflow worker nodes when using `ExecutionMode.LOCAL` instead of Python's subprocess. Closes: astronomer#865
- Loading branch information
Showing
3 changed files
with
178 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.