Skip to content

Commit c6c7073

Browse files
committed
Article about accepted values checks.
1 parent f94bc9a commit c6c7073

File tree

1 file changed

+129
-10
lines changed

1 file changed

+129
-10
lines changed

docs/categories-of-data-quality-checks/how-to-validate-accepted-values-in-columns.md

+129-10
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,52 @@
1-
# Detecting data quality issues with accepted values
2-
Read this guide to learn what types of data quality checks are supported in DQOps to detect issues related to accepted values.
3-
The data quality checks are configured in the `accepted_values` category in DQOps.
1+
# Asserting accepted values in columns
2+
Read this guide to learn how to verify that text and numeric columns contain accepted values. Assert that all expected values are used in tested columns.
43

54
## Accepted values category
6-
Data quality checks that are detecting issues related to accepted values are listed below.
5+
Data quality checks for asserting accepted values in columns are defined in the `accepted_values` category of data quality checks.
76

8-
![Column profiling result with most popular values](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/column-profiling-result-top-column-values-min.png){ loading=lazy }
7+
### What is an accepted value
8+
An accepted value is a well-known value that we expect to be used as one of the values in a column.
9+
10+
The examples of testing accepted values will use a 311 Austin municipal services call history table.
11+
The table contains requests from four counties in the Austin metro area: *Travis*, *Williamson*, *Hays*, and *Bastrop*.
12+
The column profiling results confirm that all service calls reported in the table are in these counties, written in capital case.
13+
14+
![Column profiling result with most popular values](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/column-profiling-result-top-column-values-statistics-min.png){ loading=lazy }
15+
16+
17+
## Verify that ONLY accepted values are used
18+
The most common data quality issue affecting columns that store well-known values
19+
is the presence of values outside the list of expected values.
20+
We know all the possible values that should be stored in a column.
21+
We want to ensure that no unknown value is accidentally stored in the column.
22+
Invalid values appear because of typing mistakes or errors in the data transformation code.
23+
24+
DQOps has a dedicated data quality check for testing if a column contains only valid (expected) values.
25+
This data quality check has two variants for text and numeric data types.
26+
27+
- [*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md) for testing text columns
28+
29+
- [*number_found_in_set_percent*](../checks/column/accepted_values/number-found-in-set-percent.md) for testing numeric columns
30+
31+
### Configure the check in UI
32+
The text_found_in_set_percent check measures the percentage of rows that contain only one of the expected values in the column.
33+
The min_percent parameter controls the minimum accepted percentage of rows.
34+
To verify that all rows contain only the expected values, set the parameter to 100%.
35+
36+
This data quality check also needs a list of expected values that the check uses to test values in the column.
37+
The list is specified in the expected_values parameter, which is a list (array) of values.
938

1039
![Enabling text in set percent data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/daily-text-found-in-set-percent-data-quality-check-editor-min.png){ loading=lazy }
1140

41+
The list of the expected values is configured in a popup window.
42+
1243
![Adding a list of expected values in a data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/expected-values-list-popup-data-quality-check-min.png){ loading=lazy }
1344

1445

46+
### Configure the check in YAML
47+
DQOps stores the configuration of both the [*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md)
48+
and [*number_found_in_set_percent*](../checks/column/accepted_values/number-found-in-set-percent.md) checks in a YAML file.
49+
The following sample YAML file shows the configuration of a daily monitoring check that tests accepted values daily.
1550

1651
``` { .yaml linenums="1" hl_lines="12-15" }
1752
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
@@ -33,13 +68,33 @@ spec:
3368
min_percent: 100.0
3469
```
3570

71+
### Defining data dictionaries
72+
DQOps supports defining reusable data dictionaries.
73+
The data dictionaries are simple CSV files without the header line.
74+
Please read [the concept of referencing data dictionaries](../dqo-concepts/configuring-data-quality-checks-and-rules.md#referencing-data-dictionaries)
75+
guide to learn more.
76+
77+
The screens for defining data dictionaries are found in the configuration section of the DQOps user interface.
78+
The following example shows how to add a data dictionary named austin_counties.csv.
79+
3680
![Adding data dictionary CSV file in DQOps](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/add-data-dictionary-editor-min.png){ loading=lazy }
3781

82+
The data dictionary list screen shows a dictionary reference token used in the data quality checks to reference the data dictionary.
83+
The token to access the *austin_counties.csv* dictionary is `${dictionary://austin_counties.csv}`.
84+
3885
![Data dictionary list screen for data quality checks](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/data-dictionary-list-screen-min.png){ loading=lazy }
3986

87+
### Referencing dictionaries in UI
88+
The dictionary reference is used as one of the values for the *expected_values* parameter.
89+
DQOps supports referencing multiple data dictionaries, which are merged.
90+
Mixing standalone values and data dictionaries is also supported.
91+
4092
![Referencing data dictionary in a text found in set percent data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/daily-text-found-in-set-percent-check-reference-dictionary-min.png){ loading=lazy }
4193

42-
``` { .yaml linenums="1" hl_lines="12-15" }
94+
### Referencing dictionaries in YAML
95+
When used in a YAML file, the data dictionary reference token should be wrapped in double quotes.
96+
97+
``` { .yaml linenums="1" hl_lines="12" }
4398
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
4499
apiVersion: dqo/v1
45100
kind: table
@@ -51,14 +106,40 @@ spec:
51106
daily_text_found_in_set_percent:
52107
parameters:
53108
expected_values:
54-
- "${dictionary://cities.csv}"
109+
- "${dictionary://austin_counties.csv}"
55110
error:
56111
min_percent: 100.0
57112
```
58113

114+
## Verify that ALL accepted values are in use
115+
DQOps has data quality checks to ensure all the expected values are used in the column.
116+
This type of check is useful for testing that the most common values are always used,
117+
especially that the expected values are present in every partition.
118+
The list of expected values can be a subset of all possible values.
119+
120+
DQOps has two types of checks for testing that all accepted values are in use.
121+
122+
- [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md) tests text columns,
123+
the list of expected values contains texts.
124+
125+
- [*expected_numbers_in_use_count*](../checks/column/accepted_values/expected-numbers-in-use-count.md) tests numeric columns,
126+
the list of expected values contains numbers.
127+
128+
Despite the suffix *_count* at the name of these checks, they are counting expected values
129+
that were not found in the column. It differs from the concept of other *_count* data quality checks in
130+
DQOps that count rows.
131+
132+
### Configure the check in UI
133+
The configuration of the [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md)
134+
check is very similar. The `max_missing` rule parameter configures the maximum number of expected values
135+
that can be missing in the column. Use the value 0 for the `max_missing` to test that all the expected values are in use.
59136

60137
![Asserting that all expected text values are present in a column](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/all-expected-column-values-are-in-use-data-quality-check-min.png){ loading=lazy }
61138

139+
### Configure the check in YAML
140+
The configuration of the [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md)
141+
check in YAML is straightforward.
142+
62143
``` { .yaml linenums="1" hl_lines="12-15" }
63144
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
64145
apiVersion: dqo/v1
@@ -79,14 +160,54 @@ spec:
79160
max_missing: 0
80161
```
81162

163+
### Data quality issue example
164+
The following example shows the difference between the
165+
[*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md) and
166+
[*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md) checks
167+
when an unreferenced value is configured in the expected values list. The *BURNET* value is a nearby county name.
168+
The table does not have any rows containing that value.
169+
170+
The [*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md) check passes
171+
because the first four county names are found in the column, even if *BURNET* is absent.
172+
The [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md) check fails
173+
because no row contains the expected *BURNET* value.
174+
82175
![Detecting expected values that are missing in a column](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/text-values-found-vs-text-values-in-use-min.png){ loading=lazy }
83176

177+
178+
## Testing the most common values
179+
The column profiling result shows the number of occurrences of each column value,
180+
sorted from the most popular to the least popular values.
181+
The first two counties *TRAVIS* and *WILLIAMSON* are used in almost all service requests.
182+
183+
We can use the [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md)
184+
data quality check to ensure that these two county names are always the most common values in the column.
185+
84186
![Top values in a column to assert in a data quality check](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/column-profiling-result-top-two-column-values-min.png){ width="619px"; loading=lazy }
85187

188+
### Configure the check in UI
189+
The [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md) data quality check
190+
has two parameters.
191+
192+
- The `expected_values` parameter is a list of values that should be most common in the column.
193+
194+
- The `top` parameter allows us to expand the range of top values.
195+
196+
The value for the `top` parameter should be at least equal to the number of expected values.
197+
In that case, DQOps verifies that all the expected values are still the most common ones in the column.
198+
The `top` value can be higher. We can test that two expected values are always in the column's top three most common texts.
199+
200+
The `max_missing` rule parameter controls the tolerance for missing expected common values.
201+
When the `max_missing` rule parameter is 0, DQOps must find all expected values at the top.
202+
The parameter value 1 allows one missing value, and so on.
86203

87204
![Asserting that expected values are in the top of most popular values in a column](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/text-values-in-top-most-popular-min.png){ loading=lazy }
88205

89-
``` { .yaml linenums="1" hl_lines="12-15" }
206+
### Configure the check in YAML
207+
The configuration of the [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md)
208+
check in YAML is straightforward.
209+
210+
``` { .yaml linenums="1" hl_lines="12-16" }
90211
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
91212
apiVersion: dqo/v1
92213
kind: table
@@ -105,8 +226,6 @@ spec:
105226
max_missing: 0
106227
```
107228

108-
## Detecting accepted values issues
109-
How to detect accepted values data quality issues.
110229

111230
## Use cases
112231
| **Name of the example** | **Description** |

0 commit comments

Comments
 (0)