diff --git a/main/rules/testing/index.html b/main/rules/testing/index.html index 012dbad9..a164e3c2 100644 --- a/main/rules/testing/index.html +++ b/main/rules/testing/index.html @@ -417,15 +417,6 @@ - - -
Source freshness is useful for understanding if your data pipelines are in a healthy state and is a critical component of defining SLAs for your warehouse. Enabling freshness for sources also facilitates referencing the source freshness results in the selectors for a more efficient execution.
How to Remediate
-Apply a source freshness block to the source definition. This can be implemented at either the source name or table name level.
+fct_test_coverage
(source) contains metrics pertaining to project-wide test coverage.
Specifically, this models measures:
This package highlights areas of a dbt project that are misaligned with dbt Labs' best practices. Specifically, this package tests for:
In addition to tests, this package creates the model int_all_dag_relationships
which holds information about your DAG in a tabular format and can be queried using SQL in your Warehouse.
Currently, the following adapters are supported:
Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.
"},{"location":"#additional-setup-for-databrickssparkduckdbredshift","title":"Additional setup for Databricks/Spark/DuckDB/Redshift","text":"In your dbt_project.yml
, add the following config:
dispatch:\n - macro_namespace: dbt\n search_order: ['dbt_project_evaluator', 'dbt']\n
This is required because the project currently overrides a small number of dbt core macros in order to ensure the project can run across the listed adapters. The overridden macros are in the cross_db_shim directory.
"},{"location":"#how-it-works","title":"How It Works","text":"This package will:
Once you've installed the package, all you have to do is run a dbt build --select package:dbt_project_evaluator
Each test warning indicates the presence of a type of misalignment. To troubleshoot a misalignment:
on-run-end
hook to display the rules violations in the dbt logs (see displaying violations in the logs)BigQuery current support for recursive CTEs is limited and Databricks SQL doesn't support recursive CTEs.
For those Data Warehouses, the model int_all_dag_relationships
needs to be created by looping CTEs instead. The number of loops is configured with max_depth_dag
and defaulted to 9. This means that dependencies between models of more than 9 levels of separation won't show in the model int_all_dag_relationships
but tests on the DAG will still be correct. With a number of loops higher than 9 BigQuery sometimes raises an error saying the query is too complex.
Once you have addressed all current misalignments in your project (either by fixing them or configuring exceptions), you can use this package as a CI check to ensure code changes don't introduce new misalignments. The setup will vary based on whether you are using dbt Cloud or dbt Core, but the general steps are as follows:
"},{"location":"ci-check/#1-override-test-severity-with-an-environment-variable","title":"1. Override test severity with an environment variable","text":"By default the tests in this package are configured with \"warn\" severity, we can override that for our CI jobs with an environment variable:
Create an environment variable to define the appropriate severity for each environment. In dbt Cloud, for example, we can easily create an environment variable DBT_PROJECT_EVALUATOR_SEVERITY
that is set to \"error\" for the Continuous Integration environment and \"warn\" for all other environments:
Note: It is also possible to use an environment variable for dbt Core, but the actual implementation will depend on how dbt is orchestrated.
Update you project.yml file to override the default severity for all tests in this package:
dbt_project.ymltests:\n dbt_project_evaluator:\n +severity: \"{{ env_var('DBT_PROJECT_EVALUATOR_SEVERITY', 'warn') }}\"\n
Note
You could follow a similar process to disable the models in this package for your production environment
dbt_project.ymlmodels:\n dbt_project_evaluator:\n +enabled: \"{{ env_var('DBT_PROJECT_EVALUATOR_ENABLED', 'true') | lower == 'true' | as_bool }}\"\n
Now, you can run this package as a step of your CI job/pipeline. In dbt Cloud, for example, you could update the commands of your CI job to:
dbt build --select state:modified+ --exclude package:dbt_project_evaluator\ndbt build --select package:dbt_project_evaluator\n
Or, if you've configured any exceptions, to:
dbt build --select state:modified+ --exclude package:dbt_project_evaluator\ndbt build --select package:dbt_project_evaluator dbt_project_evaluator_exceptions\n
Note
Ensure you have properly set up your dbt Cloud CI job using deferral and a webhook trigger by following this documentation.
"},{"location":"contributing/","title":"Contributing","text":"If you'd like to add models to flag new areas, please update this documentation and add an integration test (more details here)
"},{"location":"contributing/#running-docs-locally","title":"Running docs locally","text":"Docs are generated using Material for MkDocs. To test them locally, run the following commands (use a Python virtual environment):
pip install mkdocs-material\nmkdocs serve\n
Docs are then automatically pushed to the website as part of our CI/CD process. We use mike as part of the process to publish different versions of the docs.
"},{"location":"contributing/#recommended-vscode-extensions-to-help-with-writing-docs","title":"Recommended VSCode extensions to help with writing docs","text":"markdownlint
The config used in .vscode/settings.json
is the following:
\"markdownlint.config\": {\n \"ul-indent\": {\"indent\": 4},\n \"MD036\": false,\n \"MD046\": false,\n}\n
Mardown All in One
The model int_all_dag_relationships
(source), created with the package, lists all the dbt nodes (models, exposures, sources, metrics, seeds, snapshots) along with all their dependencies (including indirect ones) and the path between them.
Building additional models and snapshots on top of this model could allow:
"},{"location":"querying-the-dag/#creating-a-dashboard-that-provides-info-on-your-project","title":"Creating a dashboard that provides info on your project","text":"sql_complexity
from the table int_all_graph_resources
, based on the weights defined in the token_costs
variableref(int_all_dag_relationships)
with custom tests added for a specific use casefct_staging_dependent_on_staging
Modeling Source Fanout fct_source_fanout
Modeling Rejoining of Upstream Concepts fct_rejoining_of_upstream_concepts
Modeling Model Fanout fct_model_fanout
Modeling Downstream Models Dependent on Source fct_marts_or_intermediate_dependent_on_source
Modeling Direct Join to Source fct_direct_join_to_source
Modeling Duplicate Sources fct_duplicate_sources
Modeling Hard Coded References fct_hard_coded_references
Modeling Multiple Sources Joined fct_multiple_sources_joined
Modeling Root Models fct_root_models
Modeling Staging Models Dependent on Downstream Models fct_staging_dependent_on_marts_or_intermediate
Modeling Unused Sources fct_unused_sources
Modeling Models with Too Many Joins fct_too_many_joins
Testing Missing Primary Key Tests fct_missing_primary_key_tests
Testing Missing Source Freshness fct_sources_without_freshness
Testing Test Coverage fct_test_coverage
Documentation Undocumented Models fct_undocumented_models
Documentation Documentation Coverage fct_documentation_coverage
Documentation Undocumented Source Tables fct_undocumented_source_tables
Documentation Undocumented Sources fct_undocumented_sources
Structure Test Directories fct_test_directories
Structure Model Naming Conventions fct_model_naming_conventions
Structure Source Directories fct_source_directories
Structure Model Directories fct_model_directories
Performance Chained View Dependencies fct_chained_views_dependencies
Performance Exposure Parents Materializations fct_exposure_parents_materializations
Governance Public Models Without Contracts fct_public_models_without_contracts
Governance Exposures Dependent on Private Models fct_exposures_dependent_on_private_models
Governance Undocumented Public Models fct_undocumented_public_models
"},{"location":"customization/customization/","title":"Disabling checks from the package","text":"Note
This section is describing how to completely deactivate tests from the package. If you are looking to deactivate models/sources from being tested, you can look at excluding packages and paths
All the tests done as part of the package are tied to fct
models.
If there is a particular test or set of tests that you do not want this package to execute, you can disable the corresponding fct
models as you would any other model in your dbt_project.yml
file
models:\n dbt_project_evaluator:\n marts:\n tests:\n # disable entire test coverage suite\n +enabled: false\n dag:\n # disable single DAG model\n fct_model_fanout:\n +enabled: false\n
"},{"location":"customization/exceptions/","title":"Configuring exceptions to the rules","text":"While the rules defined in this package are considered best practices, we realize that there might be exceptions to those rules and people might want to exclude given results to get passing tests despite not following all the recommendations.
An example would be excluding all models with names matching with stg_..._unioned
from fct_multiple_sources_joined
as we might want to union 2 different tables representing the same data in some of our staging models and we don't want the test to fail for those models.
The package offers the ability to define a seed called dbt_project_evaluator_exceptions.csv
to list those exceptions we don't want to be reported. This seed must contain the following columns:
fct_name
: the name of the fact table for which we want to define exceptions (Please note that it is not possible to exclude specific models for all the coverage
tests, but there are variables available to configure those to the particular users' needs)column_name
: the column name from fct_name
we will be looking at to define exceptionsid_to_exclude
: the values (or like
pattern) we want to exclude for column_name
comment
: a field where people can document why a given exception is legitimateThe following section describes the steps to follow to configure exceptions.
"},{"location":"customization/exceptions/#1-create-a-new-seed","title":"1. Create a new seed","text":"With our previous example, the seed dbt_project_evaluator_exceptions.csv
would look like:
fct_name,column_name,id_to_exclude,comment\nfct_multiple_sources_joined,child,stg_%_unioned,Models called _unioned can union multiple sources\n
which looks like the following when loaded in the warehouse
fct_name column_name id_to_exclude comment fct_multiple_sources_joined child stg_%_unioned Models called _unioned can union multiple sources"},{"location":"customization/exceptions/#2-deactivate-the-seed-from-the-original-package","title":"2. Deactivate the seed from the original package","text":"Only a single seed can exist with a given name. When using a custom one, we need to deactivate the blank one from the package by adding the following to our dbt_project.yml
seeds:\n dbt_project_evaluator:\n dbt_project_evaluator_exceptions:\n +enabled: false\n
"},{"location":"customization/exceptions/#3-run-the-seed-and-the-package","title":"3. Run the seed and the package","text":"We then run both the seed and the package by executing the following command:
dbt build --select package:dbt_project_evaluator dbt_project_evaluator_exceptions\n
"},{"location":"customization/excluding-packages-and-paths/","title":"Excluding packages or sources/models based on their path","text":"Note
This section is describing how to entirely exclude models/sources and packages to be evaluated. If you want to document exceptions to the rules, see the section on exceptions and if you want to deactivate entire tests you can follow instructions from this page
There might be cases where you want to exclude models/sources from being tested:
In that case, this package provides the ability to exclude whole packages and/or models and sources based on their path
"},{"location":"customization/excluding-packages-and-paths/#configuration","title":"Configuration","text":"The variables exclude_packages
and exclude_paths_from_project
allow you to define a list of regex patterns to exclude from being reported as errors.
exclude_packages
accepts a list of package names to exclude from the tool. To exclude all packages except the current project, you can set it to [\"all\"]
exclude_paths_from_project
accepts a list of regular expressions of paths to exclude for the current project<path/to/model.sql>
, allowing to exclude packages, but also whole folders or individual models<path/to/sources.yml>:<source_name>.<source_table_name>
(the pattern is different than for models because the path itself doesn't let us exclude individual sources)Note
We currently don't allow excluding metrics and exposures, as if those need to be entirely excluded they could be deactivated from the project.
If you have a specific use case requiring this ability, please raise a GitHub issue to explain the situation you'd like to solve and we can revisit this decision !
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-a-whole-package","title":"Example to exclude a whole package","text":"dbt_project.ymlvars:\n exclude_packages: [\"upstream_package\"]\n
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-modelssources-in-a-given-path","title":"Example to exclude models/sources in a given path","text":"dbt_project.ymlvars:\n exclude_paths_from_project: [\"/models/legacy/\"]\n
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-both-a-package-and-modelssources-in-2-different-paths","title":"Example to exclude both a package and models/sources in 2 different paths","text":"dbt_project.ymlvars:\n exclude_packages: [\"upstream_package\"]\n exclude_paths_from_project: [\"/models/legacy/\", \"/my_date_spine.sql\"]\n
"},{"location":"customization/excluding-packages-and-paths/#tips-and-tricks","title":"Tips and tricks","text":"Regular expressions are very powerful but can become complex. After defining your value for exclude_paths_from_project
, we recommend running the package and inspecting the model int_all_graph_resources
, checking if the value in the column is_excluded
matches your expectation.
A useful tool to debug regular expression is regex101. You can provide a pattern and a list of strings to see which ones actually match the pattern.
"},{"location":"customization/issues-in-log/","title":"Displaying violations in the logs","text":"This package provides a macro that can be executed via an on-run-end
hook to display the package results in the logs in addition to storing those in the Data Warehouse.
To use it, you can add the following line in your dbt_project.yml
:
on-run-end: \"{{ dbt_project_evaluator.print_dbt_project_evaluator_issues() }}\"\n
The macro accepts two parameters:
format='table'
(default) or format='csv'
quote='`'
or quote='\"'
You can also log the results of your custom rules by applying dbt_project_evaluator.is_empty
to the custom models.
models:\n - name: my_custom_rule_model\n description: This is my custom project evaluator check \n tests:\n - dbt_project_evaluator.is_empty\n
"},{"location":"customization/overriding-variables/","title":"Overriding Variables","text":"Currently, this package uses different variables to adapt the models to your objectives and naming conventions. They can all be updated directly in dbt_project.yml
test_coverage_target
the minimum acceptable test coverage percentage 100% documentation_coverage_target
the minimum acceptable documentation coverage percentage 100% primary_key_test_macros
the set(s) of dbt tests used to check validity of a primary key [[\"dbt.test_unique\", \"dbt.test_not_null\"], [\"dbt_utils.test_unique_combination_of_columns\"]]
enforced_primary_key_node_types
the set of node types for you you would like to enforce primary key test coverage. Valid options to include are model
, source
, snapshot
, seed
[\"model\"]
Usage notes for primary_key_test_macros:
The primary_key_test_macros
variable determines how the fct_missing_primary_key_tests
(source) model evaluates whether the models in your project are properly tested for their grain. This variable is a list and each entry must be a list of test names in project_name.test_macro_name
format.
For each entry in the parent list, the logic in int_model_test_summary
will evaluate whether each model has all of the tests in that entry applied. If a model meets the criteria of any of the entries in the parent list, it will be considered a pass. The default behavior for this package will check for whether each model has either:
not_null
and unique
tests applied to a single column ORdbt_utils.unique_combination_of_columns
applied to the model.Each set of test(s) that define a primary key requirement must be grouped together in a sub-list to ensure they are evaluated together (e.g. [dbt.test_unique
, dbt.test_not_null
] ).
While it's not explicitly tested in this package, we strongly encourage adding a not_null
test on each of the columns listed in the dbt_utils.unique_combination_of_columns
tests. Alternatively, on Snowflake, consider dbt_constraints.test_primary_key
in the dbt Constraints package, which enforces each field in the primary key is non null.
# set your test and doc coverage to 75% instead\n# use the dbt_constraints.test_primary_key test to check for validity of your primary keys\n\nvars:\n dbt_project_evaluator:\n documentation_coverage_target: 75\n test_coverage_target: 75\n primary_key_test_macros: [[\"dbt_constraints.test_primary_key\"]]\n
"},{"location":"customization/overriding-variables/#dag-variables","title":"DAG Variables","text":"variable description default models_fanout_threshold
threshold for unacceptable model fanout for fct_model_fanout
3 models too_many_joins_threshold
threshold for the number of references to flag in fct_too_many_joins
7 references dbt_project.yml# set your model fanout threshold to 10 instead of 3 and too many joins from 6 instead of 7\n\nvars:\n dbt_project_evaluator:\n models_fanout_threshold: 10\n too_many_joins_threshold: 6\n
"},{"location":"customization/overriding-variables/#naming-convention-variables","title":"Naming Convention Variables","text":"variable description default model_types
a list of the different types of models that define the layers of your dbt project staging, intermediate, marts, other staging_folder_name
the name of the folder that contains your staging models staging intermediate_folder_name
the name of the folder that contains your intermediate models intermediate marts_folder_name
the name of the folder that contains your marts models marts staging_prefixes
the list of acceptable prefixes for your staging models stg_ intermediate_prefixes
the list of acceptable prefixes for your intermediate models int_ marts_prefixes
the list of acceptable prefixes for your marts models fct_, dim_ other_prefixes
the list of acceptable prefixes for your other models rpt_ The model_types
, <model_type>_folder_name
, and <model_type>_prefixes
variables allow the package to check if models in the different layers are in the correct folders and have a correct prefix in their name. The default model types are the ones we recommend in our dbt Labs Style Guide.
If your model types are different, you can update the model_types
variable and create new variables for <model_type>_folder_name
and/or <model_type>_prefixes
.
# add an additional model type \"util\"\n\nvars:\n dbt_project_evaluator:\n model_types: ['staging', 'intermediate', 'marts', 'other', 'util']\n util_folder_name: 'util'\n util_prefixes: ['util_']\n
"},{"location":"customization/overriding-variables/#performance-variables","title":"Performance Variables","text":"variable description default chained_views_threshold
threshold for unacceptable length of chain of views for fct_chained_views_dependencies
4 dbt_project.ymlvars:\n dbt_project_evaluator:\n # set your chained views threshold to 8 instead of 4\n chained_views_threshold: 8\n
"},{"location":"customization/overriding-variables/#sql-code-analysis","title":"SQL code analysis","text":"variable description default comment_chars
a list of strings used for inline comments [\"--\"]
token_costs
a dictionary of SQL tokens (words) and associated complexity weight, used to estimate models complexity see in the dbt_project.yml
file of the package"},{"location":"customization/overriding-variables/#execution","title":"Execution","text":"variable description default max_depth_dag
limits the maximum distance between nodes calculated in int_all_dag_relationships
9 for bigquery and spark, -1 for other adatpters insert_batch_size
number of records inserted per batch when unpacking the graph into models 10000 Note on max_depth_dag
The default behavior for limiting the relationships calculated in the int_all_dag_relationships
model differs depending on your adapter.
int_all_dag_relationships
, is set by the max_depth_dag
variable, which is defaulted to 9. So by default, int_all_dag_relationships
contains a row for every path less than or equal to 9 nodes in length between two nodes in your DAG. This is because these adapters do not currently support recursive SQL, and queries often fail on more than 9 recursive joins.int_all_dag_relationships
by default contains a row for every single path between two nodes in your DAG. If you experience long runtimes for the int_all_dag_relationships
model, you may consider limiting the length of your generated DAG paths. To do this, set max_depth_dag: {{ whatever limit you want to enforce }}
. The value of max_depth_dag
must be greater than 2 for all DAG tests to work, and greater than chained_views_threshold
to ensure your performance tests to work. By default, the value of this variable for these adapters is -1, which the package interprets as \"no limit\".vars:\n dbt_project_evaluator:\n # update the number of records inserted from the graph from 10,000 to 500 to reduce query size\n insert_batch_size: 500\n # set the maximum distance between nodes to 5 \n max_depth_dag: 5\n
"},{"location":"customization/querying-columns-names-and-descriptions/","title":"Querying columns names and descriptions with SQL","text":"The model stg_columns
(source), created with the package, lists all the columns configured in all the dbt nodes (models, sources, tests, snapshots).
It will not list the columns of the models that have not explicitly been added to the YAML files.
You can use this model to help with questions such as:
You can create a custom test against {{ ref(stg_columns) }}
to test for your specific check! When running the package you'd need to make sure to also include children of the package's models by using the package:dbt_project_evalutator+
selector.
fct_documentation_coverage
with\n\nmodels as (\n select * from {{ ref('int_all_graph_resources') }}\n where resource_type = 'model'\n and not is_excluded\n),\n\nconversion as (\n select\n resource_id,\n case when is_described then 1 else 0 end as is_described_model,\n {% for model_type in var('model_types') %}\n case when model_type = '{{ model_type }}' then 1.0 else NULL end as is_{{ model_type }}_model,\n case when is_described and model_type = '{{ model_type }}' then 1.0 else 0 end as is_described_{{ model_type }}_model{% if not loop.last %},{% endif %}\n {% endfor %}\n\n from models\n),\n\nfinal as (\n select\n {{ dbt.current_timestamp() if target.type != 'trino' else 'current_timestamp(6)' }} as measured_at,\n cast(count(*) as {{ dbt.type_int() }}) as total_models,\n cast(sum(is_described_model) as {{ dbt.type_int() }}) as documented_models,\n round(sum(is_described_model) * 100.00 / count(*), 2) as documentation_coverage_pct,\n {% for model_type in var('model_types') %}\n round(\n {{ dbt_utils.safe_divide(\n numerator = \"sum(is_described_\" ~ model_type ~ \"_model) * 100\", \n denominator = \"count(is_\" ~ model_type ~ \"_model)\"\n ) }}\n , 2) as {{ model_type }}_documentation_coverage_pct{% if not loop.last %},{% endif %}\n {% endfor %}\n\n from models\n left join conversion\n on models.resource_id = conversion.resource_id\n)\n\nselect * from final\n
fct_documentation_coverage
(source) calculates the percent of enabled models in the project that have a configured description.
This model will raise a warn
error on a dbt build
or dbt test
if the documentation_coverage_pct
is less than 100%. You can set your own threshold by overriding the documentation_coverage_target
variable. See overriding variables section.
Reason to Flag
Good documentation for your dbt models will help downstream consumers discover and understand the datasets which you curate for them. The documentation for your project includes model code, a DAG of your project, any tests you've added to a column, and more.
How to Remediate
Apply a text description in the model's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the model's .yml
entry.
Tip
We recommend that every model in your dbt project has at minimum a model-level description. This ensures that each model's purpose is clear to other developers and stakeholders when viewing the dbt docs site.
"},{"location":"rules/documentation/#undocumented-models","title":"Undocumented Models","text":"fct_undocumented_models
(source) lists every model with no description configured.
Reason to Flag
Good documentation for your dbt models will help downstream consumers discover and understand the datasets which you curate for them. The documentation for your project includes model code, a DAG of your project, any tests you've added to a column, and more.
How to Remediate
Apply a text description in the model's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the model's .yml
entry.
Tip
We recommend that every model in your dbt project has at minimum a model-level description. This ensures that each model's purpose is clear to other developers and stakeholders when viewing the dbt docs site. Missing documentation should be addressed first for marts models, then for the rest of your project, to ensure that stakeholders in the organization can understand the data which is surfaced to them.
"},{"location":"rules/documentation/#undocumented-source-tables","title":"Undocumented Source Tables","text":"fct_undocumented_source_tables
(source) lists every source table with no description configured.
Reason to Flag
Good documentation for your dbt sources will help contributors to your project understand how and when data is loaded into your warehouse.
How to Remediate
Apply a text description in the table's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the table's .yml
entry.
sources:\n - name: my_source\n tables:\n - name: my_table\n description: This is the source table description\n
"},{"location":"rules/documentation/#undocumented-sources","title":"Undocumented Sources","text":"fct_undocumented_sources
(source) lists every source with no description configured.
Reason to Flag
Good documentation for your dbt sources will help contributors to your project understand how and when data is loaded into your warehouse.
How to Remediate
Apply a text description in the source's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the source's .yml
entry.
sources:\n - name: my_source\n description: This is the source description\n tables:\n - name: my_table\n
"},{"location":"rules/governance/","title":"Governance","text":"This set of rules provides checks on your project against dbt Labs' recommended best proactices for adding model governance features in dbt versions 1.5 and above.
"},{"location":"rules/governance/#public-models-without-contracts","title":"Public models without contracts","text":"fct_public_models_without_contract
(source) shows each model with access
configured as public, but is not a contracted model.
Example
report_1
is defined as a public model, but does not have the contract
configuration to enforce its datatypes.
# public model without a contract\nmodels:\n - name: report_1\n description: very important OKR reporting model\n access: public\n
Reason to Flag
Models with public access are free to be consumed by any downstream consumer. This implies a need for better guarantees around the model's data types and columns. Adding a contract to the model will ensure that the model always conforms to the datatypes, columns, and other constraints you expect.
How to Remediate
Edit the yml to include the contract configuration, as well as a column entry for all columns output by the model, including their datatype. While not strictly required for defining a contracts, it's best practice to also document each column as well.
models:\n - name: report_1\n description: very important OKR reporting model\n access: public\n config:\n contract:\n enforced: true\n columns:\n - name: id \n data_type: integer\n
"},{"location":"rules/governance/#undocumented-public-models","title":"Undocumented public models","text":"fct_undocumented_public_models
(source) shows each model with access
configured as public that is not fully documented. This check is similar to fct_undocumented_models
(source), but is a stricter check that will highlight any public model that does not have a model-level description as well descriptions on each of its columns.
Example
report_1
is defined as a public model, but does not descriptions on the model and each column.
# public model without documentation\nmodels:\n - name: report_1\n access: public\n columns:\n - name: id\n
Reason to Flag
Models with public access are free to be consumed by any downstream consumer. This implies a need for higher standards for the model's usability for those cosumers. Adding more documentation can help consumers understand how they should leverage the data from your public model.
How to Remediate
Edit the yml to include a model level description, as well as a column entry with a description for all columns output by the model. While not strictly required for public models, these should likely also have contracts added as well. (See above rule)
models:\n - name: report_1\n description: very important OKR reporting model\n access: public\n columns:\n - name: id \n description: the primary key of my OKR model\n
"},{"location":"rules/governance/#exposures-dependent-on-private-models","title":"Exposures dependent on private models","text":"fct_exposures_dependent_on_private_models
(source) shows each relationship between a resource and an exposure where the parent resource is not a model with access
configured as public.
Example
Here's a sample DAG that shows direct exposure relationships.
If this were the yml for these two parent models, dim_model_7
would be flagged by this check, as it is not a public model.
models:\n - name: fct_model_6\n description: very important OKR reporting model\n access: public\n config:\n materialized: table\n contract:\n enforced: true\n columns:\n - name: id \n description: the primary key of my OKR model\n data_type: integer\n - name: dim_model_7\n description: excellent model\n access: private\n
Reason to Flag
Exposures show how and where your data is being consumed in downstream tools. These tools should read from public, trusted, contracted data sources. All models that are exposed to other tools should have that codified in their access
configuration.
How to Remediate
Edit the yml to include fully expose the models that your exposures depend on. This rule will only flag models that are not public
, but best practices suggest you should also fully document and contracts these public models as well.
models:\n - name: fct_model_6\n description: very important OKR reporting model\n access: public\n config:\n materialized: table\n contract:\n enforced: true\n columns:\n - name: id \n description: the primary key of my OKR model\n data_type: integer\n - name: dim_model_7\n description: excellent model\n access: public\n
"},{"location":"rules/modeling/","title":"Modeling","text":""},{"location":"rules/modeling/#direct-join-to-source","title":"Direct Join to Source","text":"fct_direct_join_to_source
(source) shows each parent/child relationship where a model has a reference to both a model and a source.
Example
int_model_4
is pulling in both a model and a source.
Reason to Flag
We highly recommend having a one-to-one relationship between sources and their corresponding staging
model, and not having any other model reading from the source. Those staging
models are then the ones read from by the other downstream models.
This allows renaming your columns and doing minor transformation on your source data only once and being consistent across all the models that will consume the source data.
How to Remediate
In our example, we would want to:
staging
model for our source data if it doesn't exist alreadystaging
model to other ones to create our downstream transformation instead of using the sourceAfter refactoring your downstream model to select from the staging layer, your DAG should look like this:
"},{"location":"rules/modeling/#downstream-models-dependent-on-source","title":"Downstream Models Dependent on Source","text":"fct_marts_or_intermediate_dependent_on_source
(source) shows each downstream model (marts
or intermediate
) that depends directly on a source node.
Example
fct_model_9
, a marts model, builds from source_1.table_5
a source.
Reason to Flag
We very strongly believe that a staging model is the atomic unit of data modeling. Each staging model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. With that in mind, if a marts
or intermediate
type model joins directly to a {{ source() }}
node, there likely is a missing model that needs to be added.
How to Remediate
Add the reference to the appropriate staging
model to maintain an abstraction layer between your raw data and your downstream data artifacts.
After refactoring your downstream model to select from the staging layer, your DAG should look like this:
"},{"location":"rules/modeling/#duplicate-sources","title":"Duplicate Sources","text":"fct_duplicate_sources
(source) shows each database object that corresponds to more than one source node.
Example
Imagine you have two separate source nodes - source_1.table_5
and source_1.raw_table_5
.
But both source definitions point to the exact same location in your database - real_database
.real_schema
.table_5
.
sources:\n - name: source_1\n schema: real_schema\n database: real_database\n tables:\n - name: table_5\n - name: raw_table_5\n identifier: table_5\n
Reason to Flag
If you dbt project has multiple source nodes pointing to the exact same location in your data warehouse, you will have an inaccurate view of your lineage.
How to Remediate
Combine the duplicate source nodes so that each source database location only has a single source definition in your dbt project.
"},{"location":"rules/modeling/#hard-coded-references","title":"Hard Coded References","text":"fct_hard_coded_references
(source) shows each instance where a model contains hard coded reference(s).
Example
fct_orders
uses hard coded direct relation references (my_db.my_schema.orders
and my_schema.customers
).
with orders as (\n select * from my_db.my_schema.orders\n),\ncustomers as (\n select * from my_schema.customers\n)\nselect\n orders.order_id,\n customers.name\nfrom orders\nleft join customers on\n orders.customer_id = customers.id\n
Reason to Flag
Always use the ref
function when selecting from another model and the source
function when selecting from raw data, rather than using the direct relation reference (e.g. my_schema.my_table
). Direct relation references are determined via regex mapping here.
The ref
and source
functions are part of what makes dbt so powerful! Using these functions allows dbt to infer dependencies (and check that you haven't created any circular dependencies), properly generate your DAG, and ensure that models are built in the correct order. This also ensures that your current model selects from upstream tables and views in the same environment that you're working in.
How to Remediate
For each hard coded reference:
For the above example, our updated fct_orders.sql
file would look like:
with orders as (\n select * from {{ ref('orders') }}\n),\ncustomers as (\n select * from {{ ref('customers') }}\n)\nselect\n orders.order_id,\n customers.name\nfrom orders\nleft join customers on\n orders.customer_id = customers.id\n
"},{"location":"rules/modeling/#model-fanout","title":"Model Fanout","text":"fct_model_fanout
(source) shows all parents with more than 3 direct leaf children. You can set your own threshold for model fanout by overriding the models_fanout_threshold
variable. See overriding variables section.
Example
fct_model
has three direct leaf children.
Reason to Flag
This might indicate some transformations should move to the BI layer, or a common business transformations should be moved upstream.
Exceptions
Some BI tools are better than others at joining and data exploration. For example, with Looker you could end your DAG after marts (i.e. fcts & dims) and join those artifacts together (with a little know how and setup time) to make your reports. For others, like Tableau, model fanouts might be more beneficial, as this tool prefers big tables over joins, so predefining some reports is usually more performant.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Queries and transformations can move around between dbt and the BI tool, so how do we try to stay effortful in what we decide to put where?
You can think of dbt as our assembly line which produces expected outputs every time.
You can think of the BI layer as the place where we take the items produced from our assembly line to customize them in order to meet our stakeholder's needs.
Your dbt project needs a defined end point! Until the metrics server comes to fruition, you cannot possibly predefine every query or quandary your team might have. So decide as a team where that line is and maintain it.
"},{"location":"rules/modeling/#multiple-sources-joined","title":"Multiple Sources Joined","text":"fct_multiple_sources_joined
(source) shows each instance where a model references more than one source.
Example
model_1
references two source tables.
Reason to Flag
We very strongly believe that a staging model is the atomic unit of data modeling. Each staging model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. With that in mind, two {{ source() }}
declarations in one staging model likely means we are not being composable enough and there are individual building blocks which could be broken out into their respective models.
Exceptions
Sometimes companies have a bunch of identical sources across systems. When these identical sources will only ever be used collectively, you should union them once and create a staging layer on the combined result.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
In this example specifically, those raw sources, source_1.table_1
and source_1.table_2
should each have their own staging model (stg_model_1
and stg_model_2
), as transitional steps, which will then be combined into a new int_model_2
. Alternatively, you could keep stg_model_2
and add base__
models as transitional steps.
To fix this, try out the codegen package! With this package you can dynamically generate the SQL for a staging (what they call base) model, which you will use to populate stg_model_1
and stg_model_2
directly from the source data. Create a new model int_model_2
. Afterwards, within int_model_2
, update your {{ source() }}
macros to {{ ref() }}
macros and point them to your newly built staging models. If you had type casting, field aliasing, or other simple improvements made in your original stg_model_2
SQL, then attempt to move that logic back to the new staging models instead. This will help colocate those transformations and avoid duplicate code, so that all downstream models can leverage the same set of transformations.
Post-refactor, your DAG should look like this:
or if you want to use base_ models and keep stg_model_2 as is:
"},{"location":"rules/modeling/#rejoining-of-upstream-concepts","title":"Rejoining of Upstream Concepts","text":"fct_rejoining_of_upstream_concepts
(source) contains all cases where one of the parent's direct children is ALSO the direct child of ANOTHER one of the parent's direct children. Only includes cases where the model \"in between\" the parent and child has NO other downstream dependencies.
Example
stg_model_1
, int_model_4
, and int_model_5
create a \"loop\" in the DAG. int_model_4
has no other downstream dependencies other than int_model_5
.
Reason to Flag
This could happen for a variety of reasons: Accidentally duplicating some business concepts in multiple data flows, hesitance to touch (and break) someone else\u2019s model, or perhaps trying to snowflake out or modularize everything without awareness of what will help build time.
As a general rule, snowflaking out models in a thoughtful manner allows for concurrency, but in this example nothing downstream can run until int_model_4
finishes, so it is not saving any time in parallel processing by being its own model. Since both int_model_4
and int_model_5
depend solely on stg_model_1
, there is likely a better way to write the SQL within one model (int_model_5
) and simplify the DAG, potentially at the expense of more rows of SQL within the model.
Exceptions
The one major exception to this would be when using a function from dbt_utils package, such as star
or get_column_values
, (or similar functions / packages) that require a relation as an argument input. If the shape of the data in the output of stg_model_1
is not the same as what you need for the input to the function within int_model_5
, then you will indeed need int_model_4
to create that relation, in which case, leave it.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Barring jinja/macro/relation exceptions we mention directly above, to resolve this, simply bring the SQL contents from int_model_4
into a CTE within int_model_5
, and swap all {{ ref('int_model_4') }}
references to the new CTE(s).
Post-refactor, your DAG should look like this:
"},{"location":"rules/modeling/#root-models","title":"Root Models","text":"fct_root_models
(source) shows each model with 0 direct parents, meaning that the model cannot be traced back to a declared source or model in the dbt project.
Example
model_4
has no direct parents
Reason to Flag
This likely means that the model (model_4
below) contains raw table references, either to a raw data source, or another model in the project without using the {{ source() }}
or {{ ref() }}
functions, respectively. This means that dbt is unable to interpret the correct lineage of this model, and could result in mis-timed execution and/or circular references depending on the model\u2019s upstream dependencies.
Exceptions
This behavior may be observed in the case of a manually defined reference table that does not have any dependencies. A good example of this is a dim_calendar
table that is generated by the {{ dbt_utils.date_spine() }}
macro \u2014 this SQL logic is completely self contained, and does not require any external data sources to execute.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Start by mapping any table references in the FROM
clause of the model definition to the models or raw tables that they draw from, and replace those references with the {{ ref() }}
if the dependency is another dbt model, or the {{ source() }}
function if the table is a raw data source (this may require the declaration of a new source table). Then, visualize this model in the DAG, and refactor as appropriate according to best practices.
fct_source_fanout
(source) shows each instance where a source is the direct parent of multiple resources in the DAG.
Example
source.table_1
has more than one direct child model.
Reason to Flag
Each source node should be referenced by a single model that performs basic operations, such as renaming, recasting, and other light transformations to maintain consistency through out the project. The role of this staging model is to mirror the raw data but align it with project conventions. The staging model should act as a source of truth and a buffer- any model which depends on the data from a given source should reference the cleaned data in the staging model as opposed to referencing the source directly. This approach keeps the code DRY (any light transformations that need to be done on the raw data are performed only once). Minimizing references to the raw data will also make it easier to update the project should the format of the raw data change.
Exceptions
NoSQL databases or heavily nested data sources often have so much info json packed into a table that you need to break one raw data source into multiple base models.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Create a staging model which references the source and cleans the raw data (e.g. renaming, recasting). Any models referencing the source directly should be refactored to point towards the staging model instead.
After refactoring the above example, the DAG would look something like this:
"},{"location":"rules/modeling/#staging-models-dependent-on-downstream-models","title":"Staging Models Dependent on Downstream Models","text":"fct_staging_dependent_on_marts_or_intermediate
(source) shows each staging model that depends on an intermediate or marts model, as defined by the naming conventions and folder paths specified in your project variables.
Example
stg_model_5
, a staging model, builds from fct_model_9
a marts model.
Reason to Flag
This likely represents a misnamed file. According to dbt best practices, staging models should only select from source nodes. Dependence on downstream models indicates that this model may need to be either renamed, or reconfigured to only select from source nodes.
How to Remediate
Rename the file in the child
column to use to appropriate prefix, or change the models lineage by pointing the staging model to the appropriate {{ source() }}
.
After updating the model to use the appropriate {{ source() }}
function, your graph should look like this:
fct_staging_dependent_on_staging
(source) shows each parent/child relationship where models in the staging layer are dependent on each other.
Example
stg_model_2
is a parent of stg_model_4
.
Reason to Flag
This may indicate a change in naming is necessary, or that the child model should instead reference a source.
How to Remediate
You should either change the model type of the child
(maybe to an intermediate or marts model) or change the child's lineage instead reference the appropriate {{ source() }}
.
In our example, we might realize that stg_model_4
is actually an intermediate model. We should move this file to the appropriate intermediate directory and update the file name to int_model_4
.
fct_unused_sources
(source) shows each source with 0 children.
Example
source.table_4
isn't being referenced.
Reason to Flag
This represents either a source that you have defined in YML but never brought into a model or a model that was deprecated and the corresponding rows in the source block of the YML file were not deleted at the same time. This simply represents the buildup of cruft in the project that doesn\u2019t need to be there.
How to Remediate
Navigate to the sources.yml
file (or whatever your company has called the file) that corresponds to the unused source. Within the YML file, remove the unused table name, along with descriptions or any other nested information.
sources:\n - name: some_source\n database: raw\n tables:\n - name: table_1\n - name: table_2\n - name: table_3\n - name: table_4 # <-- remove this line\n
"},{"location":"rules/modeling/#models-with-too-many-joins","title":"Models with Too Many Joins","text":"fct_too_many_joins
(source) shows models with a reference to too many other models or sources.
The number of different references to start raising errors is set to 7 by default, but you can set your own threshold by overriding the too_many_joins_threshold
variable. See overriding variables section.
Example
fct_model_1
directly references seven (7) staging models upstream.
Reason to Flag
This likely represents a model in which too much is being done. Having a model that too many upstream models introduces a lot of code complexity, which can be challenging to understand and maintain.
How to Remediate
Bringing together a reasonable number (typically 4 to 6) of entities or concepts (staging models, or perhaps other intermediate models) that will be joined with another similarly purposed intermediate model to generate a mart. Rather than having too many joins, we can join two intermediate models that each house a piece of the complexity, giving us increased readability, flexibility, testing surface area, and insight into our components.
"},{"location":"rules/performance/","title":"Performance","text":""},{"location":"rules/performance/#chained-view-dependencies","title":"Chained View Dependencies","text":"fct_chained_views_dependencies
(source) contains models that are dependent on chains of \"non-physically-materialized\" models (views and ephemerals), highlighting potential cases for improving performance by switching the materialization of model(s) within the chain to table or incremental.
This model will raise a warn
error on a dbt build
or dbt test
if the distance
between a given parent
and child
is greater than or equal to 4. You can set your own threshold for chained views by overriding the chained_views_threshold
variable. See overriding variables section.
Example
table_1
depends on a chain of 4 views (view_1
, view_2
, view_3
, and view_4
).
Reason to Flag
You may experience a long runtime for a model when it is build on top of a long chain of \"non-physically-materialized\" models (views and ephemerals). In the example above, nothing is really computed until you get to table_1
. At which point, it is going to run the query within view_4
, which will then have to run the query within view_3
, which will then have the run the query within view_2
, which will then have to run the query within view_1
. These will all be running at the same time, which creates a long runtime for table_1
.
How to Remediate
We can reduce this compilation time by changing the materialization strategy of some key upstream models to table or incremental to keep a minimum amount of compute in memory and preventing nesting of views. If, for example, we change the materialization of view_4
from a view to a table, table_1
will have a shorter runtime as it will have less compilation to do.
The best practice to determine top candidates for changing materialization from view
to table
:
fct_exposure_parents_materializations
(source) highlights instances where the resources referenced by exposures are either:
source
model
that does not use the table
or incremental
materializationExample
In this case, the parents of exposure_1
are not both materialized as tables -- dim_model_7
is ephemeral, while fct_model_6
is a table. This model would return a record for the dim_model_7 --> exposure_1
relationship.
Reason to Flag
Exposures should depend on the business logic you encoded into your dbt project (e.g. models or metrics) rather than raw untransformed sources. Additionally, models that are referenced by an exposure are likely to be used heavily in downstream systems, and therefore need to be performant when queried.
How to Remediate
If you have a source parent of an exposure, you should incorporate that raw data into your project in some way, then update the exposure to point to that model.
If necessary, update the materialized
configuration on the models returned in fct_exposure_parents_materializations
to either table
or incremental
. This can be done in individual model files using a config block, or for groups of models in your dbt_project.yml
file. See the docs on model configurations for more info!
fct_model_naming_conventions
(source) shows all cases where a model does NOT have the appropriate prefix.
Example
Consider model_8
which is nested in the marts
subdirectory:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 model_8.sql\n
This model should be renamed to either fct_model_8
or dim_model_8
.
Reason to Flag
Without appropriate naming conventions, a user querying the data warehouse might incorrectly assume the model type of a given relation. In order to explicitly name the model type in the data warehouse, we recommend appropriately prefixing your models in dbt.
Model Type Appropriate Prefixes Stagingstg_
Intermediate int_
Marts fct_
or dim_
Other rpt_
How to Remediate
For each model flagged, ensure the model type is defined and the model name is prefixed appropriately.
"},{"location":"rules/structure/#model-directories","title":"Model Directories","text":"fct_model_directories
(source) shows all cases where a model is NOT in the appropriate subdirectory:
Example
Consider stg_model_3
which is a staging model for source_2.table_3
:
But, stg_model_3.sql
is inappropriately nested in the subdirectory source_1
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u2514\u2500\u2500 source_1\n \u251c\u2500\u2500 stg_model_3.sql\n
This file should be moved into the subdirectory source_2
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 source_1\n \u2514\u2500\u2500 source_2\n \u251c\u2500\u2500 stg_model_3.sql\n
Consider dim_model_7
which is a marts model but is inappropriately nested closest to the subdirectory intermediate
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 intermediate\n \u251c\u2500\u2500 dim_model_7.sql\n
This file should be moved closest to the subdirectory marts
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 dim_model_7.sql\n
Consider int_model_4
which is an intermediate model but is inappropriately nested closest to the subdirectory marts
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n
This file should be moved closest to the subdirectory intermediate
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 intermediate\n \u251c\u2500\u2500 int_model_4.sql\n
Reason to Flag
Because we often work with multiple data sources, in our staging directory, we create one subdirectory per source.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 braintree\n \u2514\u2500\u2500 stripe\n
Each staging directory contains:
This provides for clear repository organization, so that analytics engineers can quickly and easily find the information they need.
We might create additional folders for intermediate models but each file should always be nested closest to the folder name that matches their model type.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 fct_model_6.sql\n \u2514\u2500\u2500 intermediate\n \u2514\u2500\u2500 int_model_5.sql\n
How to Remediate
For each resource flagged, move the file from the current_file_path
to change_file_path_to
.
fct_source_directories
(source) shows all cases where a source definition is NOT in the appropriate subdirectory:
Example
Consider source_2.table_3
which is a source_2
source but it had been defined inappropriately in a source.yml
file nested in the subdirectory source_1
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u2514\u2500\u2500 source_1\n \u251c\u2500\u2500 source.yml\n
This definition should be moved into a source.yml
file nested in the subdirectory source_2
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 source_1\n \u2514\u2500\u2500 source_2\n \u251c\u2500\u2500 source.yml\n
Reason to Flag
Because we often work with multiple data sources, in our staging directory, we create one subdirectory per source.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 braintree\n \u2514\u2500\u2500 stripe\n
Each staging directory contains:
This provides for clear repository organization, so that analytics engineers can quickly and easily find the information they need.
How to Remediate
For each source flagged, move the file from the current_file_path
to change_file_path_to
.
fct_test_directories
(source) shows all cases where model tests are NOT in the same subdirectory as the corresponding model.
Example
int_model_4
is located within marts/
. However, tests for int_model_4
are configured in staging/staging.yml
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 staging.yml\n
A new yml file should be created in marts/
which contains all tests and documentation for int_model_4
, and for the rest of the models in located in the marts/
directory:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n \u251c\u2500\u2500 marts.yml\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 staging.yml\n
Reason to Flag
Each subdirectory in models/
should contain one .yml file that includes the tests and documentation for all models within the given subdirectory. Keeping your repository organized in this way ensures that folks can quickly access the information they need.
How to Remediate
Move flagged tests from the yml file under current_test_directory
to the yml file under change_test_directory_to
(create a new yml file if one does not exist).
fct_missing_primary_key_tests
(source) lists every model that does not meet the minimum testing requirement of testing primary keys. Any model that does not have either
not_null
test and a unique
test applied to a single column ORdbt_utils.unique_combination_of_columns
test applied to a set of columns ORnot_null
constraint and a unique
test applied to a single columnwill be flagged by this model.
Reason to Flag
Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). Defining tests is a great way to confirm that your code is working correctly, and helps prevent regressions when your code changes. Models without proper tests on their grain are a risk to the reliability and scalability of your project.
How to Remediate
Apply a uniqueness test and a not null test to the column that represents the grain of your model in its schema entry. For contracted models, optionally replace the not null test with the not null constraint. For models that are unique across a combination of columns, we recommend adding a surrogate key column to your model, then applying these tests to that new model. See the surrogate_key
macro from dbt_utils for more info! Alternatively, you can use the dbt_utils.unique_combination_of_columns
test from dbt_utils
. Check out the overriding variables section to read more about configuring other primary key tests for your project!
Additional tests can be configured by applying a generic test in the model's .yml
entry or by creating a singular test in the tests
directory of you project.
Enforcing on more node types(Advanced)
You can optionally extend this test to apply to more node types (source
,snapshot
, seed
). By configuring the variable enforced_primary_key_node_types
to be a set of node types for which you wish to enforce primary key test coverage in addition to (or instead of) just models. Check out the overriding variables section for instructions
Snapshots should always have a multi-field primary key in order to function, while sources and seeds may not. Depending on your expectations for duplicates and null values, different kinds of primary key tests may be appropriate. Consider your use case carefully.
"},{"location":"rules/testing/#missing-source-freshness","title":"Missing Source Freshness","text":"fct_sources_without_freshness
(source) lists every source that does not have a source freshness threshold defined. Any source that does not have one or both of warn_after and error_after will be flagged by this model.
Reason to Flag
Source freshness is useful for understanding if your data pipelines are in a healthy state and is a critical component of defining SLAs for your warehouse. Enabling freshness for sources also facilitates referencing the source freshness results in the selectors for a more efficient execution.
How to Remediate
"},{"location":"rules/testing/#apply-a-source-freshness-block-to-the-source-definition-this-can-be-implemented-at-either-the-source-name-or-table-name-level","title":"Apply a source freshness block to the source definition. This can be implemented at either the source name or table name level.","text":""},{"location":"rules/testing/#test-coverage","title":"Test Coverage","text":"fct_test_coverage
(source) contains metrics pertaining to project-wide test coverage. Specifically, this models measures:
test_coverage_pct
: the percentage of your models that have minimum 1 test applied.test_to_model_ratio
: the ratio of the number of tests in your dbt project to the number of models in your dbt project<model_type>_test_coverage_pct
: the percentage of each of your model types that have minimum 1 test applied.This model will raise a warn
error on a dbt build
or dbt test
if the test_coverage_pct
is less than 100%. You can set your own threshold by overriding the test_coverage_target
variable. You can adjust your own model types by overriding the model_types
variable. See overriding variables section.
Reason to Flag We recommend that every model in your dbt project has tests applied to ensure the accuracy of your data transformations.
How to Remediate
Apply a generic test in the model's .yml
entry, or create a singular test in the tests
directory of you project.
As explained above, we recommend at a minimum, every model should have not_null
and unique
tests set up on a primary key.
This package highlights areas of a dbt project that are misaligned with dbt Labs' best practices. Specifically, this package tests for:
In addition to tests, this package creates the model int_all_dag_relationships
which holds information about your DAG in a tabular format and can be queried using SQL in your Warehouse.
Currently, the following adapters are supported:
Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.
"},{"location":"#additional-setup-for-databrickssparkduckdbredshift","title":"Additional setup for Databricks/Spark/DuckDB/Redshift","text":"In your dbt_project.yml
, add the following config:
dispatch:\n - macro_namespace: dbt\n search_order: ['dbt_project_evaluator', 'dbt']\n
This is required because the project currently overrides a small number of dbt core macros in order to ensure the project can run across the listed adapters. The overridden macros are in the cross_db_shim directory.
"},{"location":"#how-it-works","title":"How It Works","text":"This package will:
Once you've installed the package, all you have to do is run a dbt build --select package:dbt_project_evaluator
Each test warning indicates the presence of a type of misalignment. To troubleshoot a misalignment:
on-run-end
hook to display the rules violations in the dbt logs (see displaying violations in the logs)BigQuery current support for recursive CTEs is limited and Databricks SQL doesn't support recursive CTEs.
For those Data Warehouses, the model int_all_dag_relationships
needs to be created by looping CTEs instead. The number of loops is configured with max_depth_dag
and defaulted to 9. This means that dependencies between models of more than 9 levels of separation won't show in the model int_all_dag_relationships
but tests on the DAG will still be correct. With a number of loops higher than 9 BigQuery sometimes raises an error saying the query is too complex.
Once you have addressed all current misalignments in your project (either by fixing them or configuring exceptions), you can use this package as a CI check to ensure code changes don't introduce new misalignments. The setup will vary based on whether you are using dbt Cloud or dbt Core, but the general steps are as follows:
"},{"location":"ci-check/#1-override-test-severity-with-an-environment-variable","title":"1. Override test severity with an environment variable","text":"By default the tests in this package are configured with \"warn\" severity, we can override that for our CI jobs with an environment variable:
Create an environment variable to define the appropriate severity for each environment. In dbt Cloud, for example, we can easily create an environment variable DBT_PROJECT_EVALUATOR_SEVERITY
that is set to \"error\" for the Continuous Integration environment and \"warn\" for all other environments:
Note: It is also possible to use an environment variable for dbt Core, but the actual implementation will depend on how dbt is orchestrated.
Update you project.yml file to override the default severity for all tests in this package:
dbt_project.ymltests:\n dbt_project_evaluator:\n +severity: \"{{ env_var('DBT_PROJECT_EVALUATOR_SEVERITY', 'warn') }}\"\n
Note
You could follow a similar process to disable the models in this package for your production environment
dbt_project.ymlmodels:\n dbt_project_evaluator:\n +enabled: \"{{ env_var('DBT_PROJECT_EVALUATOR_ENABLED', 'true') | lower == 'true' | as_bool }}\"\n
Now, you can run this package as a step of your CI job/pipeline. In dbt Cloud, for example, you could update the commands of your CI job to:
dbt build --select state:modified+ --exclude package:dbt_project_evaluator\ndbt build --select package:dbt_project_evaluator\n
Or, if you've configured any exceptions, to:
dbt build --select state:modified+ --exclude package:dbt_project_evaluator\ndbt build --select package:dbt_project_evaluator dbt_project_evaluator_exceptions\n
Note
Ensure you have properly set up your dbt Cloud CI job using deferral and a webhook trigger by following this documentation.
"},{"location":"contributing/","title":"Contributing","text":"If you'd like to add models to flag new areas, please update this documentation and add an integration test (more details here)
"},{"location":"contributing/#running-docs-locally","title":"Running docs locally","text":"Docs are generated using Material for MkDocs. To test them locally, run the following commands (use a Python virtual environment):
pip install mkdocs-material\nmkdocs serve\n
Docs are then automatically pushed to the website as part of our CI/CD process. We use mike as part of the process to publish different versions of the docs.
"},{"location":"contributing/#recommended-vscode-extensions-to-help-with-writing-docs","title":"Recommended VSCode extensions to help with writing docs","text":"markdownlint
The config used in .vscode/settings.json
is the following:
\"markdownlint.config\": {\n \"ul-indent\": {\"indent\": 4},\n \"MD036\": false,\n \"MD046\": false,\n}\n
Mardown All in One
The model int_all_dag_relationships
(source), created with the package, lists all the dbt nodes (models, exposures, sources, metrics, seeds, snapshots) along with all their dependencies (including indirect ones) and the path between them.
Building additional models and snapshots on top of this model could allow:
"},{"location":"querying-the-dag/#creating-a-dashboard-that-provides-info-on-your-project","title":"Creating a dashboard that provides info on your project","text":"sql_complexity
from the table int_all_graph_resources
, based on the weights defined in the token_costs
variableref(int_all_dag_relationships)
with custom tests added for a specific use casefct_staging_dependent_on_staging
Modeling Source Fanout fct_source_fanout
Modeling Rejoining of Upstream Concepts fct_rejoining_of_upstream_concepts
Modeling Model Fanout fct_model_fanout
Modeling Downstream Models Dependent on Source fct_marts_or_intermediate_dependent_on_source
Modeling Direct Join to Source fct_direct_join_to_source
Modeling Duplicate Sources fct_duplicate_sources
Modeling Hard Coded References fct_hard_coded_references
Modeling Multiple Sources Joined fct_multiple_sources_joined
Modeling Root Models fct_root_models
Modeling Staging Models Dependent on Downstream Models fct_staging_dependent_on_marts_or_intermediate
Modeling Unused Sources fct_unused_sources
Modeling Models with Too Many Joins fct_too_many_joins
Testing Missing Primary Key Tests fct_missing_primary_key_tests
Testing Missing Source Freshness fct_sources_without_freshness
Testing Test Coverage fct_test_coverage
Documentation Undocumented Models fct_undocumented_models
Documentation Documentation Coverage fct_documentation_coverage
Documentation Undocumented Source Tables fct_undocumented_source_tables
Documentation Undocumented Sources fct_undocumented_sources
Structure Test Directories fct_test_directories
Structure Model Naming Conventions fct_model_naming_conventions
Structure Source Directories fct_source_directories
Structure Model Directories fct_model_directories
Performance Chained View Dependencies fct_chained_views_dependencies
Performance Exposure Parents Materializations fct_exposure_parents_materializations
Governance Public Models Without Contracts fct_public_models_without_contracts
Governance Exposures Dependent on Private Models fct_exposures_dependent_on_private_models
Governance Undocumented Public Models fct_undocumented_public_models
"},{"location":"customization/customization/","title":"Disabling checks from the package","text":"Note
This section is describing how to completely deactivate tests from the package. If you are looking to deactivate models/sources from being tested, you can look at excluding packages and paths
All the tests done as part of the package are tied to fct
models.
If there is a particular test or set of tests that you do not want this package to execute, you can disable the corresponding fct
models as you would any other model in your dbt_project.yml
file
models:\n dbt_project_evaluator:\n marts:\n tests:\n # disable entire test coverage suite\n +enabled: false\n dag:\n # disable single DAG model\n fct_model_fanout:\n +enabled: false\n
"},{"location":"customization/exceptions/","title":"Configuring exceptions to the rules","text":"While the rules defined in this package are considered best practices, we realize that there might be exceptions to those rules and people might want to exclude given results to get passing tests despite not following all the recommendations.
An example would be excluding all models with names matching with stg_..._unioned
from fct_multiple_sources_joined
as we might want to union 2 different tables representing the same data in some of our staging models and we don't want the test to fail for those models.
The package offers the ability to define a seed called dbt_project_evaluator_exceptions.csv
to list those exceptions we don't want to be reported. This seed must contain the following columns:
fct_name
: the name of the fact table for which we want to define exceptions (Please note that it is not possible to exclude specific models for all the coverage
tests, but there are variables available to configure those to the particular users' needs)column_name
: the column name from fct_name
we will be looking at to define exceptionsid_to_exclude
: the values (or like
pattern) we want to exclude for column_name
comment
: a field where people can document why a given exception is legitimateThe following section describes the steps to follow to configure exceptions.
"},{"location":"customization/exceptions/#1-create-a-new-seed","title":"1. Create a new seed","text":"With our previous example, the seed dbt_project_evaluator_exceptions.csv
would look like:
fct_name,column_name,id_to_exclude,comment\nfct_multiple_sources_joined,child,stg_%_unioned,Models called _unioned can union multiple sources\n
which looks like the following when loaded in the warehouse
fct_name column_name id_to_exclude comment fct_multiple_sources_joined child stg_%_unioned Models called _unioned can union multiple sources"},{"location":"customization/exceptions/#2-deactivate-the-seed-from-the-original-package","title":"2. Deactivate the seed from the original package","text":"Only a single seed can exist with a given name. When using a custom one, we need to deactivate the blank one from the package by adding the following to our dbt_project.yml
seeds:\n dbt_project_evaluator:\n dbt_project_evaluator_exceptions:\n +enabled: false\n
"},{"location":"customization/exceptions/#3-run-the-seed-and-the-package","title":"3. Run the seed and the package","text":"We then run both the seed and the package by executing the following command:
dbt build --select package:dbt_project_evaluator dbt_project_evaluator_exceptions\n
"},{"location":"customization/excluding-packages-and-paths/","title":"Excluding packages or sources/models based on their path","text":"Note
This section is describing how to entirely exclude models/sources and packages to be evaluated. If you want to document exceptions to the rules, see the section on exceptions and if you want to deactivate entire tests you can follow instructions from this page
There might be cases where you want to exclude models/sources from being tested:
In that case, this package provides the ability to exclude whole packages and/or models and sources based on their path
"},{"location":"customization/excluding-packages-and-paths/#configuration","title":"Configuration","text":"The variables exclude_packages
and exclude_paths_from_project
allow you to define a list of regex patterns to exclude from being reported as errors.
exclude_packages
accepts a list of package names to exclude from the tool. To exclude all packages except the current project, you can set it to [\"all\"]
exclude_paths_from_project
accepts a list of regular expressions of paths to exclude for the current project<path/to/model.sql>
, allowing to exclude packages, but also whole folders or individual models<path/to/sources.yml>:<source_name>.<source_table_name>
(the pattern is different than for models because the path itself doesn't let us exclude individual sources)Note
We currently don't allow excluding metrics and exposures, as if those need to be entirely excluded they could be deactivated from the project.
If you have a specific use case requiring this ability, please raise a GitHub issue to explain the situation you'd like to solve and we can revisit this decision !
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-a-whole-package","title":"Example to exclude a whole package","text":"dbt_project.ymlvars:\n exclude_packages: [\"upstream_package\"]\n
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-modelssources-in-a-given-path","title":"Example to exclude models/sources in a given path","text":"dbt_project.ymlvars:\n exclude_paths_from_project: [\"/models/legacy/\"]\n
"},{"location":"customization/excluding-packages-and-paths/#example-to-exclude-both-a-package-and-modelssources-in-2-different-paths","title":"Example to exclude both a package and models/sources in 2 different paths","text":"dbt_project.ymlvars:\n exclude_packages: [\"upstream_package\"]\n exclude_paths_from_project: [\"/models/legacy/\", \"/my_date_spine.sql\"]\n
"},{"location":"customization/excluding-packages-and-paths/#tips-and-tricks","title":"Tips and tricks","text":"Regular expressions are very powerful but can become complex. After defining your value for exclude_paths_from_project
, we recommend running the package and inspecting the model int_all_graph_resources
, checking if the value in the column is_excluded
matches your expectation.
A useful tool to debug regular expression is regex101. You can provide a pattern and a list of strings to see which ones actually match the pattern.
"},{"location":"customization/issues-in-log/","title":"Displaying violations in the logs","text":"This package provides a macro that can be executed via an on-run-end
hook to display the package results in the logs in addition to storing those in the Data Warehouse.
To use it, you can add the following line in your dbt_project.yml
:
on-run-end: \"{{ dbt_project_evaluator.print_dbt_project_evaluator_issues() }}\"\n
The macro accepts two parameters:
format='table'
(default) or format='csv'
quote='`'
or quote='\"'
You can also log the results of your custom rules by applying dbt_project_evaluator.is_empty
to the custom models.
models:\n - name: my_custom_rule_model\n description: This is my custom project evaluator check \n tests:\n - dbt_project_evaluator.is_empty\n
"},{"location":"customization/overriding-variables/","title":"Overriding Variables","text":"Currently, this package uses different variables to adapt the models to your objectives and naming conventions. They can all be updated directly in dbt_project.yml
test_coverage_target
the minimum acceptable test coverage percentage 100% documentation_coverage_target
the minimum acceptable documentation coverage percentage 100% primary_key_test_macros
the set(s) of dbt tests used to check validity of a primary key [[\"dbt.test_unique\", \"dbt.test_not_null\"], [\"dbt_utils.test_unique_combination_of_columns\"]]
enforced_primary_key_node_types
the set of node types for you you would like to enforce primary key test coverage. Valid options to include are model
, source
, snapshot
, seed
[\"model\"]
Usage notes for primary_key_test_macros:
The primary_key_test_macros
variable determines how the fct_missing_primary_key_tests
(source) model evaluates whether the models in your project are properly tested for their grain. This variable is a list and each entry must be a list of test names in project_name.test_macro_name
format.
For each entry in the parent list, the logic in int_model_test_summary
will evaluate whether each model has all of the tests in that entry applied. If a model meets the criteria of any of the entries in the parent list, it will be considered a pass. The default behavior for this package will check for whether each model has either:
not_null
and unique
tests applied to a single column ORdbt_utils.unique_combination_of_columns
applied to the model.Each set of test(s) that define a primary key requirement must be grouped together in a sub-list to ensure they are evaluated together (e.g. [dbt.test_unique
, dbt.test_not_null
] ).
While it's not explicitly tested in this package, we strongly encourage adding a not_null
test on each of the columns listed in the dbt_utils.unique_combination_of_columns
tests. Alternatively, on Snowflake, consider dbt_constraints.test_primary_key
in the dbt Constraints package, which enforces each field in the primary key is non null.
# set your test and doc coverage to 75% instead\n# use the dbt_constraints.test_primary_key test to check for validity of your primary keys\n\nvars:\n dbt_project_evaluator:\n documentation_coverage_target: 75\n test_coverage_target: 75\n primary_key_test_macros: [[\"dbt_constraints.test_primary_key\"]]\n
"},{"location":"customization/overriding-variables/#dag-variables","title":"DAG Variables","text":"variable description default models_fanout_threshold
threshold for unacceptable model fanout for fct_model_fanout
3 models too_many_joins_threshold
threshold for the number of references to flag in fct_too_many_joins
7 references dbt_project.yml# set your model fanout threshold to 10 instead of 3 and too many joins from 6 instead of 7\n\nvars:\n dbt_project_evaluator:\n models_fanout_threshold: 10\n too_many_joins_threshold: 6\n
"},{"location":"customization/overriding-variables/#naming-convention-variables","title":"Naming Convention Variables","text":"variable description default model_types
a list of the different types of models that define the layers of your dbt project staging, intermediate, marts, other staging_folder_name
the name of the folder that contains your staging models staging intermediate_folder_name
the name of the folder that contains your intermediate models intermediate marts_folder_name
the name of the folder that contains your marts models marts staging_prefixes
the list of acceptable prefixes for your staging models stg_ intermediate_prefixes
the list of acceptable prefixes for your intermediate models int_ marts_prefixes
the list of acceptable prefixes for your marts models fct_, dim_ other_prefixes
the list of acceptable prefixes for your other models rpt_ The model_types
, <model_type>_folder_name
, and <model_type>_prefixes
variables allow the package to check if models in the different layers are in the correct folders and have a correct prefix in their name. The default model types are the ones we recommend in our dbt Labs Style Guide.
If your model types are different, you can update the model_types
variable and create new variables for <model_type>_folder_name
and/or <model_type>_prefixes
.
# add an additional model type \"util\"\n\nvars:\n dbt_project_evaluator:\n model_types: ['staging', 'intermediate', 'marts', 'other', 'util']\n util_folder_name: 'util'\n util_prefixes: ['util_']\n
"},{"location":"customization/overriding-variables/#performance-variables","title":"Performance Variables","text":"variable description default chained_views_threshold
threshold for unacceptable length of chain of views for fct_chained_views_dependencies
4 dbt_project.ymlvars:\n dbt_project_evaluator:\n # set your chained views threshold to 8 instead of 4\n chained_views_threshold: 8\n
"},{"location":"customization/overriding-variables/#sql-code-analysis","title":"SQL code analysis","text":"variable description default comment_chars
a list of strings used for inline comments [\"--\"]
token_costs
a dictionary of SQL tokens (words) and associated complexity weight, used to estimate models complexity see in the dbt_project.yml
file of the package"},{"location":"customization/overriding-variables/#execution","title":"Execution","text":"variable description default max_depth_dag
limits the maximum distance between nodes calculated in int_all_dag_relationships
9 for bigquery and spark, -1 for other adatpters insert_batch_size
number of records inserted per batch when unpacking the graph into models 10000 Note on max_depth_dag
The default behavior for limiting the relationships calculated in the int_all_dag_relationships
model differs depending on your adapter.
int_all_dag_relationships
, is set by the max_depth_dag
variable, which is defaulted to 9. So by default, int_all_dag_relationships
contains a row for every path less than or equal to 9 nodes in length between two nodes in your DAG. This is because these adapters do not currently support recursive SQL, and queries often fail on more than 9 recursive joins.int_all_dag_relationships
by default contains a row for every single path between two nodes in your DAG. If you experience long runtimes for the int_all_dag_relationships
model, you may consider limiting the length of your generated DAG paths. To do this, set max_depth_dag: {{ whatever limit you want to enforce }}
. The value of max_depth_dag
must be greater than 2 for all DAG tests to work, and greater than chained_views_threshold
to ensure your performance tests to work. By default, the value of this variable for these adapters is -1, which the package interprets as \"no limit\".vars:\n dbt_project_evaluator:\n # update the number of records inserted from the graph from 10,000 to 500 to reduce query size\n insert_batch_size: 500\n # set the maximum distance between nodes to 5 \n max_depth_dag: 5\n
"},{"location":"customization/querying-columns-names-and-descriptions/","title":"Querying columns names and descriptions with SQL","text":"The model stg_columns
(source), created with the package, lists all the columns configured in all the dbt nodes (models, sources, tests, snapshots).
It will not list the columns of the models that have not explicitly been added to the YAML files.
You can use this model to help with questions such as:
You can create a custom test against {{ ref(stg_columns) }}
to test for your specific check! When running the package you'd need to make sure to also include children of the package's models by using the package:dbt_project_evalutator+
selector.
fct_documentation_coverage
with\n\nmodels as (\n select * from {{ ref('int_all_graph_resources') }}\n where resource_type = 'model'\n and not is_excluded\n),\n\nconversion as (\n select\n resource_id,\n case when is_described then 1 else 0 end as is_described_model,\n {% for model_type in var('model_types') %}\n case when model_type = '{{ model_type }}' then 1.0 else NULL end as is_{{ model_type }}_model,\n case when is_described and model_type = '{{ model_type }}' then 1.0 else 0 end as is_described_{{ model_type }}_model{% if not loop.last %},{% endif %}\n {% endfor %}\n\n from models\n),\n\nfinal as (\n select\n {{ dbt.current_timestamp() if target.type != 'trino' else 'current_timestamp(6)' }} as measured_at,\n cast(count(*) as {{ dbt.type_int() }}) as total_models,\n cast(sum(is_described_model) as {{ dbt.type_int() }}) as documented_models,\n round(sum(is_described_model) * 100.00 / count(*), 2) as documentation_coverage_pct,\n {% for model_type in var('model_types') %}\n round(\n {{ dbt_utils.safe_divide(\n numerator = \"sum(is_described_\" ~ model_type ~ \"_model) * 100\", \n denominator = \"count(is_\" ~ model_type ~ \"_model)\"\n ) }}\n , 2) as {{ model_type }}_documentation_coverage_pct{% if not loop.last %},{% endif %}\n {% endfor %}\n\n from models\n left join conversion\n on models.resource_id = conversion.resource_id\n)\n\nselect * from final\n
fct_documentation_coverage
(source) calculates the percent of enabled models in the project that have a configured description.
This model will raise a warn
error on a dbt build
or dbt test
if the documentation_coverage_pct
is less than 100%. You can set your own threshold by overriding the documentation_coverage_target
variable. See overriding variables section.
Reason to Flag
Good documentation for your dbt models will help downstream consumers discover and understand the datasets which you curate for them. The documentation for your project includes model code, a DAG of your project, any tests you've added to a column, and more.
How to Remediate
Apply a text description in the model's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the model's .yml
entry.
Tip
We recommend that every model in your dbt project has at minimum a model-level description. This ensures that each model's purpose is clear to other developers and stakeholders when viewing the dbt docs site.
"},{"location":"rules/documentation/#undocumented-models","title":"Undocumented Models","text":"fct_undocumented_models
(source) lists every model with no description configured.
Reason to Flag
Good documentation for your dbt models will help downstream consumers discover and understand the datasets which you curate for them. The documentation for your project includes model code, a DAG of your project, any tests you've added to a column, and more.
How to Remediate
Apply a text description in the model's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the model's .yml
entry.
Tip
We recommend that every model in your dbt project has at minimum a model-level description. This ensures that each model's purpose is clear to other developers and stakeholders when viewing the dbt docs site. Missing documentation should be addressed first for marts models, then for the rest of your project, to ensure that stakeholders in the organization can understand the data which is surfaced to them.
"},{"location":"rules/documentation/#undocumented-source-tables","title":"Undocumented Source Tables","text":"fct_undocumented_source_tables
(source) lists every source table with no description configured.
Reason to Flag
Good documentation for your dbt sources will help contributors to your project understand how and when data is loaded into your warehouse.
How to Remediate
Apply a text description in the table's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the table's .yml
entry.
sources:\n - name: my_source\n tables:\n - name: my_table\n description: This is the source table description\n
"},{"location":"rules/documentation/#undocumented-sources","title":"Undocumented Sources","text":"fct_undocumented_sources
(source) lists every source with no description configured.
Reason to Flag
Good documentation for your dbt sources will help contributors to your project understand how and when data is loaded into your warehouse.
How to Remediate
Apply a text description in the source's .yml
entry, or create a docs block in a markdown file, and use the {{ doc() }}
function in the source's .yml
entry.
sources:\n - name: my_source\n description: This is the source description\n tables:\n - name: my_table\n
"},{"location":"rules/governance/","title":"Governance","text":"This set of rules provides checks on your project against dbt Labs' recommended best proactices for adding model governance features in dbt versions 1.5 and above.
"},{"location":"rules/governance/#public-models-without-contracts","title":"Public models without contracts","text":"fct_public_models_without_contract
(source) shows each model with access
configured as public, but is not a contracted model.
Example
report_1
is defined as a public model, but does not have the contract
configuration to enforce its datatypes.
# public model without a contract\nmodels:\n - name: report_1\n description: very important OKR reporting model\n access: public\n
Reason to Flag
Models with public access are free to be consumed by any downstream consumer. This implies a need for better guarantees around the model's data types and columns. Adding a contract to the model will ensure that the model always conforms to the datatypes, columns, and other constraints you expect.
How to Remediate
Edit the yml to include the contract configuration, as well as a column entry for all columns output by the model, including their datatype. While not strictly required for defining a contracts, it's best practice to also document each column as well.
models:\n - name: report_1\n description: very important OKR reporting model\n access: public\n config:\n contract:\n enforced: true\n columns:\n - name: id \n data_type: integer\n
"},{"location":"rules/governance/#undocumented-public-models","title":"Undocumented public models","text":"fct_undocumented_public_models
(source) shows each model with access
configured as public that is not fully documented. This check is similar to fct_undocumented_models
(source), but is a stricter check that will highlight any public model that does not have a model-level description as well descriptions on each of its columns.
Example
report_1
is defined as a public model, but does not descriptions on the model and each column.
# public model without documentation\nmodels:\n - name: report_1\n access: public\n columns:\n - name: id\n
Reason to Flag
Models with public access are free to be consumed by any downstream consumer. This implies a need for higher standards for the model's usability for those cosumers. Adding more documentation can help consumers understand how they should leverage the data from your public model.
How to Remediate
Edit the yml to include a model level description, as well as a column entry with a description for all columns output by the model. While not strictly required for public models, these should likely also have contracts added as well. (See above rule)
models:\n - name: report_1\n description: very important OKR reporting model\n access: public\n columns:\n - name: id \n description: the primary key of my OKR model\n
"},{"location":"rules/governance/#exposures-dependent-on-private-models","title":"Exposures dependent on private models","text":"fct_exposures_dependent_on_private_models
(source) shows each relationship between a resource and an exposure where the parent resource is not a model with access
configured as public.
Example
Here's a sample DAG that shows direct exposure relationships.
If this were the yml for these two parent models, dim_model_7
would be flagged by this check, as it is not a public model.
models:\n - name: fct_model_6\n description: very important OKR reporting model\n access: public\n config:\n materialized: table\n contract:\n enforced: true\n columns:\n - name: id \n description: the primary key of my OKR model\n data_type: integer\n - name: dim_model_7\n description: excellent model\n access: private\n
Reason to Flag
Exposures show how and where your data is being consumed in downstream tools. These tools should read from public, trusted, contracted data sources. All models that are exposed to other tools should have that codified in their access
configuration.
How to Remediate
Edit the yml to include fully expose the models that your exposures depend on. This rule will only flag models that are not public
, but best practices suggest you should also fully document and contracts these public models as well.
models:\n - name: fct_model_6\n description: very important OKR reporting model\n access: public\n config:\n materialized: table\n contract:\n enforced: true\n columns:\n - name: id \n description: the primary key of my OKR model\n data_type: integer\n - name: dim_model_7\n description: excellent model\n access: public\n
"},{"location":"rules/modeling/","title":"Modeling","text":""},{"location":"rules/modeling/#direct-join-to-source","title":"Direct Join to Source","text":"fct_direct_join_to_source
(source) shows each parent/child relationship where a model has a reference to both a model and a source.
Example
int_model_4
is pulling in both a model and a source.
Reason to Flag
We highly recommend having a one-to-one relationship between sources and their corresponding staging
model, and not having any other model reading from the source. Those staging
models are then the ones read from by the other downstream models.
This allows renaming your columns and doing minor transformation on your source data only once and being consistent across all the models that will consume the source data.
How to Remediate
In our example, we would want to:
staging
model for our source data if it doesn't exist alreadystaging
model to other ones to create our downstream transformation instead of using the sourceAfter refactoring your downstream model to select from the staging layer, your DAG should look like this:
"},{"location":"rules/modeling/#downstream-models-dependent-on-source","title":"Downstream Models Dependent on Source","text":"fct_marts_or_intermediate_dependent_on_source
(source) shows each downstream model (marts
or intermediate
) that depends directly on a source node.
Example
fct_model_9
, a marts model, builds from source_1.table_5
a source.
Reason to Flag
We very strongly believe that a staging model is the atomic unit of data modeling. Each staging model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. With that in mind, if a marts
or intermediate
type model joins directly to a {{ source() }}
node, there likely is a missing model that needs to be added.
How to Remediate
Add the reference to the appropriate staging
model to maintain an abstraction layer between your raw data and your downstream data artifacts.
After refactoring your downstream model to select from the staging layer, your DAG should look like this:
"},{"location":"rules/modeling/#duplicate-sources","title":"Duplicate Sources","text":"fct_duplicate_sources
(source) shows each database object that corresponds to more than one source node.
Example
Imagine you have two separate source nodes - source_1.table_5
and source_1.raw_table_5
.
But both source definitions point to the exact same location in your database - real_database
.real_schema
.table_5
.
sources:\n - name: source_1\n schema: real_schema\n database: real_database\n tables:\n - name: table_5\n - name: raw_table_5\n identifier: table_5\n
Reason to Flag
If you dbt project has multiple source nodes pointing to the exact same location in your data warehouse, you will have an inaccurate view of your lineage.
How to Remediate
Combine the duplicate source nodes so that each source database location only has a single source definition in your dbt project.
"},{"location":"rules/modeling/#hard-coded-references","title":"Hard Coded References","text":"fct_hard_coded_references
(source) shows each instance where a model contains hard coded reference(s).
Example
fct_orders
uses hard coded direct relation references (my_db.my_schema.orders
and my_schema.customers
).
with orders as (\n select * from my_db.my_schema.orders\n),\ncustomers as (\n select * from my_schema.customers\n)\nselect\n orders.order_id,\n customers.name\nfrom orders\nleft join customers on\n orders.customer_id = customers.id\n
Reason to Flag
Always use the ref
function when selecting from another model and the source
function when selecting from raw data, rather than using the direct relation reference (e.g. my_schema.my_table
). Direct relation references are determined via regex mapping here.
The ref
and source
functions are part of what makes dbt so powerful! Using these functions allows dbt to infer dependencies (and check that you haven't created any circular dependencies), properly generate your DAG, and ensure that models are built in the correct order. This also ensures that your current model selects from upstream tables and views in the same environment that you're working in.
How to Remediate
For each hard coded reference:
For the above example, our updated fct_orders.sql
file would look like:
with orders as (\n select * from {{ ref('orders') }}\n),\ncustomers as (\n select * from {{ ref('customers') }}\n)\nselect\n orders.order_id,\n customers.name\nfrom orders\nleft join customers on\n orders.customer_id = customers.id\n
"},{"location":"rules/modeling/#model-fanout","title":"Model Fanout","text":"fct_model_fanout
(source) shows all parents with more than 3 direct leaf children. You can set your own threshold for model fanout by overriding the models_fanout_threshold
variable. See overriding variables section.
Example
fct_model
has three direct leaf children.
Reason to Flag
This might indicate some transformations should move to the BI layer, or a common business transformations should be moved upstream.
Exceptions
Some BI tools are better than others at joining and data exploration. For example, with Looker you could end your DAG after marts (i.e. fcts & dims) and join those artifacts together (with a little know how and setup time) to make your reports. For others, like Tableau, model fanouts might be more beneficial, as this tool prefers big tables over joins, so predefining some reports is usually more performant.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Queries and transformations can move around between dbt and the BI tool, so how do we try to stay effortful in what we decide to put where?
You can think of dbt as our assembly line which produces expected outputs every time.
You can think of the BI layer as the place where we take the items produced from our assembly line to customize them in order to meet our stakeholder's needs.
Your dbt project needs a defined end point! Until the metrics server comes to fruition, you cannot possibly predefine every query or quandary your team might have. So decide as a team where that line is and maintain it.
"},{"location":"rules/modeling/#multiple-sources-joined","title":"Multiple Sources Joined","text":"fct_multiple_sources_joined
(source) shows each instance where a model references more than one source.
Example
model_1
references two source tables.
Reason to Flag
We very strongly believe that a staging model is the atomic unit of data modeling. Each staging model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. With that in mind, two {{ source() }}
declarations in one staging model likely means we are not being composable enough and there are individual building blocks which could be broken out into their respective models.
Exceptions
Sometimes companies have a bunch of identical sources across systems. When these identical sources will only ever be used collectively, you should union them once and create a staging layer on the combined result.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
In this example specifically, those raw sources, source_1.table_1
and source_1.table_2
should each have their own staging model (stg_model_1
and stg_model_2
), as transitional steps, which will then be combined into a new int_model_2
. Alternatively, you could keep stg_model_2
and add base__
models as transitional steps.
To fix this, try out the codegen package! With this package you can dynamically generate the SQL for a staging (what they call base) model, which you will use to populate stg_model_1
and stg_model_2
directly from the source data. Create a new model int_model_2
. Afterwards, within int_model_2
, update your {{ source() }}
macros to {{ ref() }}
macros and point them to your newly built staging models. If you had type casting, field aliasing, or other simple improvements made in your original stg_model_2
SQL, then attempt to move that logic back to the new staging models instead. This will help colocate those transformations and avoid duplicate code, so that all downstream models can leverage the same set of transformations.
Post-refactor, your DAG should look like this:
or if you want to use base_ models and keep stg_model_2 as is:
"},{"location":"rules/modeling/#rejoining-of-upstream-concepts","title":"Rejoining of Upstream Concepts","text":"fct_rejoining_of_upstream_concepts
(source) contains all cases where one of the parent's direct children is ALSO the direct child of ANOTHER one of the parent's direct children. Only includes cases where the model \"in between\" the parent and child has NO other downstream dependencies.
Example
stg_model_1
, int_model_4
, and int_model_5
create a \"loop\" in the DAG. int_model_4
has no other downstream dependencies other than int_model_5
.
Reason to Flag
This could happen for a variety of reasons: Accidentally duplicating some business concepts in multiple data flows, hesitance to touch (and break) someone else\u2019s model, or perhaps trying to snowflake out or modularize everything without awareness of what will help build time.
As a general rule, snowflaking out models in a thoughtful manner allows for concurrency, but in this example nothing downstream can run until int_model_4
finishes, so it is not saving any time in parallel processing by being its own model. Since both int_model_4
and int_model_5
depend solely on stg_model_1
, there is likely a better way to write the SQL within one model (int_model_5
) and simplify the DAG, potentially at the expense of more rows of SQL within the model.
Exceptions
The one major exception to this would be when using a function from dbt_utils package, such as star
or get_column_values
, (or similar functions / packages) that require a relation as an argument input. If the shape of the data in the output of stg_model_1
is not the same as what you need for the input to the function within int_model_5
, then you will indeed need int_model_4
to create that relation, in which case, leave it.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Barring jinja/macro/relation exceptions we mention directly above, to resolve this, simply bring the SQL contents from int_model_4
into a CTE within int_model_5
, and swap all {{ ref('int_model_4') }}
references to the new CTE(s).
Post-refactor, your DAG should look like this:
"},{"location":"rules/modeling/#root-models","title":"Root Models","text":"fct_root_models
(source) shows each model with 0 direct parents, meaning that the model cannot be traced back to a declared source or model in the dbt project.
Example
model_4
has no direct parents
Reason to Flag
This likely means that the model (model_4
below) contains raw table references, either to a raw data source, or another model in the project without using the {{ source() }}
or {{ ref() }}
functions, respectively. This means that dbt is unable to interpret the correct lineage of this model, and could result in mis-timed execution and/or circular references depending on the model\u2019s upstream dependencies.
Exceptions
This behavior may be observed in the case of a manually defined reference table that does not have any dependencies. A good example of this is a dim_calendar
table that is generated by the {{ dbt_utils.date_spine() }}
macro \u2014 this SQL logic is completely self contained, and does not require any external data sources to execute.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Start by mapping any table references in the FROM
clause of the model definition to the models or raw tables that they draw from, and replace those references with the {{ ref() }}
if the dependency is another dbt model, or the {{ source() }}
function if the table is a raw data source (this may require the declaration of a new source table). Then, visualize this model in the DAG, and refactor as appropriate according to best practices.
fct_source_fanout
(source) shows each instance where a source is the direct parent of multiple resources in the DAG.
Example
source.table_1
has more than one direct child model.
Reason to Flag
Each source node should be referenced by a single model that performs basic operations, such as renaming, recasting, and other light transformations to maintain consistency through out the project. The role of this staging model is to mirror the raw data but align it with project conventions. The staging model should act as a source of truth and a buffer- any model which depends on the data from a given source should reference the cleaned data in the staging model as opposed to referencing the source directly. This approach keeps the code DRY (any light transformations that need to be done on the raw data are performed only once). Minimizing references to the raw data will also make it easier to update the project should the format of the raw data change.
Exceptions
NoSQL databases or heavily nested data sources often have so much info json packed into a table that you need to break one raw data source into multiple base models.
To exclude specific cases, check out the instructions in Configuring exceptions to the rules.
How to Remediate
Create a staging model which references the source and cleans the raw data (e.g. renaming, recasting). Any models referencing the source directly should be refactored to point towards the staging model instead.
After refactoring the above example, the DAG would look something like this:
"},{"location":"rules/modeling/#staging-models-dependent-on-downstream-models","title":"Staging Models Dependent on Downstream Models","text":"fct_staging_dependent_on_marts_or_intermediate
(source) shows each staging model that depends on an intermediate or marts model, as defined by the naming conventions and folder paths specified in your project variables.
Example
stg_model_5
, a staging model, builds from fct_model_9
a marts model.
Reason to Flag
This likely represents a misnamed file. According to dbt best practices, staging models should only select from source nodes. Dependence on downstream models indicates that this model may need to be either renamed, or reconfigured to only select from source nodes.
How to Remediate
Rename the file in the child
column to use to appropriate prefix, or change the models lineage by pointing the staging model to the appropriate {{ source() }}
.
After updating the model to use the appropriate {{ source() }}
function, your graph should look like this:
fct_staging_dependent_on_staging
(source) shows each parent/child relationship where models in the staging layer are dependent on each other.
Example
stg_model_2
is a parent of stg_model_4
.
Reason to Flag
This may indicate a change in naming is necessary, or that the child model should instead reference a source.
How to Remediate
You should either change the model type of the child
(maybe to an intermediate or marts model) or change the child's lineage instead reference the appropriate {{ source() }}
.
In our example, we might realize that stg_model_4
is actually an intermediate model. We should move this file to the appropriate intermediate directory and update the file name to int_model_4
.
fct_unused_sources
(source) shows each source with 0 children.
Example
source.table_4
isn't being referenced.
Reason to Flag
This represents either a source that you have defined in YML but never brought into a model or a model that was deprecated and the corresponding rows in the source block of the YML file were not deleted at the same time. This simply represents the buildup of cruft in the project that doesn\u2019t need to be there.
How to Remediate
Navigate to the sources.yml
file (or whatever your company has called the file) that corresponds to the unused source. Within the YML file, remove the unused table name, along with descriptions or any other nested information.
sources:\n - name: some_source\n database: raw\n tables:\n - name: table_1\n - name: table_2\n - name: table_3\n - name: table_4 # <-- remove this line\n
"},{"location":"rules/modeling/#models-with-too-many-joins","title":"Models with Too Many Joins","text":"fct_too_many_joins
(source) shows models with a reference to too many other models or sources.
The number of different references to start raising errors is set to 7 by default, but you can set your own threshold by overriding the too_many_joins_threshold
variable. See overriding variables section.
Example
fct_model_1
directly references seven (7) staging models upstream.
Reason to Flag
This likely represents a model in which too much is being done. Having a model that too many upstream models introduces a lot of code complexity, which can be challenging to understand and maintain.
How to Remediate
Bringing together a reasonable number (typically 4 to 6) of entities or concepts (staging models, or perhaps other intermediate models) that will be joined with another similarly purposed intermediate model to generate a mart. Rather than having too many joins, we can join two intermediate models that each house a piece of the complexity, giving us increased readability, flexibility, testing surface area, and insight into our components.
"},{"location":"rules/performance/","title":"Performance","text":""},{"location":"rules/performance/#chained-view-dependencies","title":"Chained View Dependencies","text":"fct_chained_views_dependencies
(source) contains models that are dependent on chains of \"non-physically-materialized\" models (views and ephemerals), highlighting potential cases for improving performance by switching the materialization of model(s) within the chain to table or incremental.
This model will raise a warn
error on a dbt build
or dbt test
if the distance
between a given parent
and child
is greater than or equal to 4. You can set your own threshold for chained views by overriding the chained_views_threshold
variable. See overriding variables section.
Example
table_1
depends on a chain of 4 views (view_1
, view_2
, view_3
, and view_4
).
Reason to Flag
You may experience a long runtime for a model when it is build on top of a long chain of \"non-physically-materialized\" models (views and ephemerals). In the example above, nothing is really computed until you get to table_1
. At which point, it is going to run the query within view_4
, which will then have to run the query within view_3
, which will then have the run the query within view_2
, which will then have to run the query within view_1
. These will all be running at the same time, which creates a long runtime for table_1
.
How to Remediate
We can reduce this compilation time by changing the materialization strategy of some key upstream models to table or incremental to keep a minimum amount of compute in memory and preventing nesting of views. If, for example, we change the materialization of view_4
from a view to a table, table_1
will have a shorter runtime as it will have less compilation to do.
The best practice to determine top candidates for changing materialization from view
to table
:
fct_exposure_parents_materializations
(source) highlights instances where the resources referenced by exposures are either:
source
model
that does not use the table
or incremental
materializationExample
In this case, the parents of exposure_1
are not both materialized as tables -- dim_model_7
is ephemeral, while fct_model_6
is a table. This model would return a record for the dim_model_7 --> exposure_1
relationship.
Reason to Flag
Exposures should depend on the business logic you encoded into your dbt project (e.g. models or metrics) rather than raw untransformed sources. Additionally, models that are referenced by an exposure are likely to be used heavily in downstream systems, and therefore need to be performant when queried.
How to Remediate
If you have a source parent of an exposure, you should incorporate that raw data into your project in some way, then update the exposure to point to that model.
If necessary, update the materialized
configuration on the models returned in fct_exposure_parents_materializations
to either table
or incremental
. This can be done in individual model files using a config block, or for groups of models in your dbt_project.yml
file. See the docs on model configurations for more info!
fct_model_naming_conventions
(source) shows all cases where a model does NOT have the appropriate prefix.
Example
Consider model_8
which is nested in the marts
subdirectory:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 model_8.sql\n
This model should be renamed to either fct_model_8
or dim_model_8
.
Reason to Flag
Without appropriate naming conventions, a user querying the data warehouse might incorrectly assume the model type of a given relation. In order to explicitly name the model type in the data warehouse, we recommend appropriately prefixing your models in dbt.
Model Type Appropriate Prefixes Stagingstg_
Intermediate int_
Marts fct_
or dim_
Other rpt_
How to Remediate
For each model flagged, ensure the model type is defined and the model name is prefixed appropriately.
"},{"location":"rules/structure/#model-directories","title":"Model Directories","text":"fct_model_directories
(source) shows all cases where a model is NOT in the appropriate subdirectory:
Example
Consider stg_model_3
which is a staging model for source_2.table_3
:
But, stg_model_3.sql
is inappropriately nested in the subdirectory source_1
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u2514\u2500\u2500 source_1\n \u251c\u2500\u2500 stg_model_3.sql\n
This file should be moved into the subdirectory source_2
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 source_1\n \u2514\u2500\u2500 source_2\n \u251c\u2500\u2500 stg_model_3.sql\n
Consider dim_model_7
which is a marts model but is inappropriately nested closest to the subdirectory intermediate
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 intermediate\n \u251c\u2500\u2500 dim_model_7.sql\n
This file should be moved closest to the subdirectory marts
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 dim_model_7.sql\n
Consider int_model_4
which is an intermediate model but is inappropriately nested closest to the subdirectory marts
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n
This file should be moved closest to the subdirectory intermediate
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 intermediate\n \u251c\u2500\u2500 int_model_4.sql\n
Reason to Flag
Because we often work with multiple data sources, in our staging directory, we create one subdirectory per source.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 braintree\n \u2514\u2500\u2500 stripe\n
Each staging directory contains:
This provides for clear repository organization, so that analytics engineers can quickly and easily find the information they need.
We might create additional folders for intermediate models but each file should always be nested closest to the folder name that matches their model type.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u2514\u2500\u2500 fct_model_6.sql\n \u2514\u2500\u2500 intermediate\n \u2514\u2500\u2500 int_model_5.sql\n
How to Remediate
For each resource flagged, move the file from the current_file_path
to change_file_path_to
.
fct_source_directories
(source) shows all cases where a source definition is NOT in the appropriate subdirectory:
Example
Consider source_2.table_3
which is a source_2
source but it had been defined inappropriately in a source.yml
file nested in the subdirectory source_1
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u2514\u2500\u2500 source_1\n \u251c\u2500\u2500 source.yml\n
This definition should be moved into a source.yml
file nested in the subdirectory source_2
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 source_1\n \u2514\u2500\u2500 source_2\n \u251c\u2500\u2500 source.yml\n
Reason to Flag
Because we often work with multiple data sources, in our staging directory, we create one subdirectory per source.
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u251c\u2500\u2500 marts\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 braintree\n \u2514\u2500\u2500 stripe\n
Each staging directory contains:
This provides for clear repository organization, so that analytics engineers can quickly and easily find the information they need.
How to Remediate
For each source flagged, move the file from the current_file_path
to change_file_path_to
.
fct_test_directories
(source) shows all cases where model tests are NOT in the same subdirectory as the corresponding model.
Example
int_model_4
is located within marts/
. However, tests for int_model_4
are configured in staging/staging.yml
:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 staging.yml\n
A new yml file should be created in marts/
which contains all tests and documentation for int_model_4
, and for the rest of the models in located in the marts/
directory:
\u251c\u2500\u2500 dbt_project.yml\n\u2514\u2500\u2500 models\n \u2514\u2500\u2500 marts\n \u251c\u2500\u2500 int_model_4.sql\n \u251c\u2500\u2500 marts.yml\n \u2514\u2500\u2500 staging\n \u251c\u2500\u2500 staging.yml\n
Reason to Flag
Each subdirectory in models/
should contain one .yml file that includes the tests and documentation for all models within the given subdirectory. Keeping your repository organized in this way ensures that folks can quickly access the information they need.
How to Remediate
Move flagged tests from the yml file under current_test_directory
to the yml file under change_test_directory_to
(create a new yml file if one does not exist).
fct_missing_primary_key_tests
(source) lists every model that does not meet the minimum testing requirement of testing primary keys. Any model that does not have either
not_null
test and a unique
test applied to a single column ORdbt_utils.unique_combination_of_columns
test applied to a set of columns ORnot_null
constraint and a unique
test applied to a single columnwill be flagged by this model.
Reason to Flag
Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). Defining tests is a great way to confirm that your code is working correctly, and helps prevent regressions when your code changes. Models without proper tests on their grain are a risk to the reliability and scalability of your project.
How to Remediate
Apply a uniqueness test and a not null test to the column that represents the grain of your model in its schema entry. For contracted models, optionally replace the not null test with the not null constraint. For models that are unique across a combination of columns, we recommend adding a surrogate key column to your model, then applying these tests to that new model. See the surrogate_key
macro from dbt_utils for more info! Alternatively, you can use the dbt_utils.unique_combination_of_columns
test from dbt_utils
. Check out the overriding variables section to read more about configuring other primary key tests for your project!
Additional tests can be configured by applying a generic test in the model's .yml
entry or by creating a singular test in the tests
directory of you project.
Enforcing on more node types(Advanced)
You can optionally extend this test to apply to more node types (source
,snapshot
, seed
). By configuring the variable enforced_primary_key_node_types
to be a set of node types for which you wish to enforce primary key test coverage in addition to (or instead of) just models. Check out the overriding variables section for instructions
Snapshots should always have a multi-field primary key in order to function, while sources and seeds may not. Depending on your expectations for duplicates and null values, different kinds of primary key tests may be appropriate. Consider your use case carefully.
"},{"location":"rules/testing/#missing-source-freshness","title":"Missing Source Freshness","text":"fct_sources_without_freshness
(source) lists every source that does not have a source freshness threshold defined. Any source that does not have one or both of warn_after and error_after will be flagged by this model.
Reason to Flag
Source freshness is useful for understanding if your data pipelines are in a healthy state and is a critical component of defining SLAs for your warehouse. Enabling freshness for sources also facilitates referencing the source freshness results in the selectors for a more efficient execution.
How to Remediate
Apply a source freshness block to the source definition. This can be implemented at either the source name or table name level.
"},{"location":"rules/testing/#test-coverage","title":"Test Coverage","text":"fct_test_coverage
(source) contains metrics pertaining to project-wide test coverage. Specifically, this models measures:
test_coverage_pct
: the percentage of your models that have minimum 1 test applied.test_to_model_ratio
: the ratio of the number of tests in your dbt project to the number of models in your dbt project<model_type>_test_coverage_pct
: the percentage of each of your model types that have minimum 1 test applied.This model will raise a warn
error on a dbt build
or dbt test
if the test_coverage_pct
is less than 100%. You can set your own threshold by overriding the test_coverage_target
variable. You can adjust your own model types by overriding the model_types
variable. See overriding variables section.
Reason to Flag We recommend that every model in your dbt project has tests applied to ensure the accuracy of your data transformations.
How to Remediate
Apply a generic test in the model's .yml
entry, or create a singular test in the tests
directory of you project.
As explained above, we recommend at a minimum, every model should have not_null
and unique
tests set up on a primary key.