You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/categories-of-data-quality-checks/how-to-validate-accepted-values-in-columns.md
+129-10
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,52 @@
1
-
# Detecting data quality issues with accepted values
2
-
Read this guide to learn what types of data quality checks are supported in DQOps to detect issues related to accepted values.
3
-
The data quality checks are configured in the `accepted_values` category in DQOps.
1
+
# Asserting accepted values in columns
2
+
Read this guide to learn how to verify that text and numeric columns contain accepted values. Assert that all expected values are used in tested columns.
4
3
5
4
## Accepted values category
6
-
Data quality checks that are detecting issues related to accepted values are listed below.
5
+
Data quality checks for asserting accepted values in columns are defined in the `accepted_values` category of data quality checks.
7
6
8
-
{ loading=lazy }
7
+
### What is an accepted value
8
+
An accepted value is a well-known value that we expect to be used as one of the values in a column.
9
+
10
+
The examples of testing accepted values will use a 311 Austin municipal services call history table.
11
+
The table contains requests from four counties in the Austin metro area: *Travis*, *Williamson*, *Hays*, and *Bastrop*.
12
+
The column profiling results confirm that all service calls reported in the table are in these counties, written in capital case.
13
+
14
+
{ loading=lazy }
15
+
16
+
17
+
## Verify that ONLY accepted values are used
18
+
The most common data quality issue affecting columns that store well-known values
19
+
is the presence of values outside the list of expected values.
20
+
We know all the possible values that should be stored in a column.
21
+
We want to ensure that no unknown value is accidentally stored in the column.
22
+
Invalid values appear because of typing mistakes or errors in the data transformation code.
23
+
24
+
DQOps has a dedicated data quality check for testing if a column contains only valid (expected) values.
25
+
This data quality check has two variants for text and numeric data types.
26
+
27
+
-[*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md) for testing text columns
28
+
29
+
-[*number_found_in_set_percent*](../checks/column/accepted_values/number-found-in-set-percent.md) for testing numeric columns
30
+
31
+
### Configure the check in UI
32
+
The text_found_in_set_percent check measures the percentage of rows that contain only one of the expected values in the column.
33
+
The min_percent parameter controls the minimum accepted percentage of rows.
34
+
To verify that all rows contain only the expected values, set the parameter to 100%.
35
+
36
+
This data quality check also needs a list of expected values that the check uses to test values in the column.
37
+
The list is specified in the expected_values parameter, which is a list (array) of values.
9
38
10
39
{ loading=lazy }
11
40
41
+
The list of the expected values is configured in a popup window.
42
+
12
43
{ loading=lazy }
13
44
14
45
46
+
### Configure the check in YAML
47
+
DQOps stores the configuration of both the [*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md)
48
+
and [*number_found_in_set_percent*](../checks/column/accepted_values/number-found-in-set-percent.md) checks in a YAML file.
49
+
The following sample YAML file shows the configuration of a daily monitoring check that tests accepted values daily.
DQOps supports defining reusable data dictionaries.
73
+
The data dictionaries are simple CSV files without the header line.
74
+
Please read [the concept of referencing data dictionaries](../dqo-concepts/configuring-data-quality-checks-and-rules.md#referencing-data-dictionaries)
75
+
guide to learn more.
76
+
77
+
The screens for defining data dictionaries are found in the configuration section of the DQOps user interface.
78
+
The following example shows how to add a data dictionary named austin_counties.csv.
79
+
36
80
{ loading=lazy }
37
81
82
+
The data dictionary list screen shows a dictionary reference token used in the data quality checks to reference the data dictionary.
83
+
The token to access the *austin_counties.csv* dictionary is `${dictionary://austin_counties.csv}`.
84
+
38
85
{ loading=lazy }
39
86
87
+
### Referencing dictionaries in UI
88
+
The dictionary reference is used as one of the values for the *expected_values* parameter.
89
+
DQOps supports referencing multiple data dictionaries, which are merged.
90
+
Mixing standalone values and data dictionaries is also supported.
91
+
40
92
{ loading=lazy }
41
93
42
-
```{ .yaml linenums="1" hl_lines="12-15" }
94
+
### Referencing dictionaries in YAML
95
+
When used in a YAML file, the data dictionary reference token should be wrapped in double quotes.
Despite the suffix *_count* at the name of these checks, they are counting expected values
129
+
that were not found in the column. It differs from the concept of other *_count* data quality checks in
130
+
DQOps that count rows.
131
+
132
+
### Configure the check in UI
133
+
The configuration of the [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md)
134
+
check is very similar. The `max_missing` rule parameter configures the maximum number of expected values
135
+
that can be missing in the column. Use the value 0 for the `max_missing` to test that all the expected values are in use.
59
136
60
137
{ loading=lazy }
61
138
139
+
### Configure the check in YAML
140
+
The configuration of the [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md)
when an unreferenced value is configured in the expected values list. The *BURNET* value is a nearby county name.
168
+
The table does not have any rows containing that value.
169
+
170
+
The [*text_found_in_set_percent*](../checks/column/accepted_values/text-found-in-set-percent.md) check passes
171
+
because the first four county names are found in the column, even if *BURNET* is absent.
172
+
The [*expected_text_values_in_use_count*](../checks/column/accepted_values/expected-text-values-in-use-count.md) check fails
173
+
because no row contains the expected *BURNET* value.
174
+
82
175
{ loading=lazy }
83
176
177
+
178
+
## Testing the most common values
179
+
The column profiling result shows the number of occurrences of each column value,
180
+
sorted from the most popular to the least popular values.
181
+
The first two counties *TRAVIS* and *WILLIAMSON* are used in almost all service requests.
182
+
183
+
We can use the [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md)
184
+
data quality check to ensure that these two county names are always the most common values in the column.
185
+
84
186
{ width="619px"; loading=lazy }
85
187
188
+
### Configure the check in UI
189
+
The [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md) data quality check
190
+
has two parameters.
191
+
192
+
- The `expected_values` parameter is a list of values that should be most common in the column.
193
+
194
+
- The `top` parameter allows us to expand the range of top values.
195
+
196
+
The value for the `top` parameter should be at least equal to the number of expected values.
197
+
In that case, DQOps verifies that all the expected values are still the most common ones in the column.
198
+
The `top` value can be higher. We can test that two expected values are always in the column's top three most common texts.
199
+
200
+
The `max_missing` rule parameter controls the tolerance for missing expected common values.
201
+
When the `max_missing` rule parameter is 0, DQOps must find all expected values at the top.
202
+
The parameter value 1 allows one missing value, and so on.
86
203
87
204
{ loading=lazy }
88
205
89
-
```{ .yaml linenums="1" hl_lines="12-15" }
206
+
### Configure the check in YAML
207
+
The configuration of the [*expected_texts_in_top_values_count*](../checks/column/accepted_values/expected-texts-in-top-values-count.md)
0 commit comments