You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/categories-of-data-quality-checks/how-to-detect-data-quality-issues-in-text-fields.md
+165-15
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,162 @@
1
-
# Detecting data quality issues with text
2
-
Read this guide to learn what types of data quality checks are supported in DQOps to detect issues related to text.
3
-
The data quality checks are configured in the `text` category in DQOps.
1
+
# Detecting out-of-range text values
2
+
Read this guide to learn how to find text values that are too short or too long, which are most likely invalid values stored in a database.
4
3
5
-
## Text category
6
-
Data quality checks that are detecting issues related to text are listed below.
4
+
The data quality checks that detect issues with too short or too long texts are configured in the `text` category in DQOps.
7
5
8
-
## Detecting text issues
9
-
How to detect text data quality issues.
6
+
## Text statistics
7
+
The statistics about text values are pretty simple.
8
+
We can analyze the length of text values and find the shorted or longest text values.
9
+
The statistics around text values are less sophisticated than calculating metrics for numeric values,
10
+
but observing the length of strings can still reveal many data quality issues.
11
+
12
+
### Issues with too short texts
13
+
Texts shorter than a reasonable minimum are a possible case of potential data corruption.
14
+
We can't expect a phone number to be only two digits.
15
+
Or an email that has just two letters must be wrong. It is not even enough to include the domain name.
16
+
17
+
Too short texts are a result of:
18
+
19
+
- Someone accidentally truncated the text manually.
20
+
21
+
- A user entered incomplete values during data entry.
22
+
23
+
- The data was corrupted in transport.
24
+
25
+
- The data loading was interrupted because the platform ran out of disk space.
26
+
27
+
- A bug in the transformation logic truncated the text.
28
+
29
+
- The temporary variable or a column in a temporary table was too short, causing truncation.
30
+
31
+
To detect these types of possible data corruption, we can choose a reasonable minimum text length
32
+
that should fit the smallest valid value, such as a phone number.
33
+
34
+
Truncated texts that are too short lead to the completeness and uniqueness issues.
35
+
If more identifiers are also truncated, we will have duplicate data.
36
+
37
+
### Issues with too long texts
38
+
Texts that are longer than expected are caused by other problems.
39
+
40
+
- The data is corrupted.
41
+
42
+
- Column values were concatenated together because a wrong separator was used to load the values.
43
+
44
+
- The column is used for different purposes than it was designed for.
45
+
Users use the phone column to enter additional comments about how to contact a person.
46
+
47
+
- The maximum column length was incorrectly estimated and unable to fit valid values, such as address lines.
48
+
49
+
Texts that are too long can also cause other problems.
50
+
51
+
- They take more storage.
52
+
53
+
- Indexes are growing too big, and queries take longer to run.
54
+
55
+
- The target column length is shorter, and valid texts will be truncated.
56
+
57
+
It is wise to find out how long the longest text should be that we plan to store in a column.
58
+
A data quality check should monitor the data to find texts that are longer than expected.
59
+
60
+
## Text length checks
61
+
DQOps has several data quality checks for validating the length of texts.
62
+
They are very similar to each other, but they can still detect different types of length issues.
63
+
64
+
- The [*text_min_length*](../checks/column/text/text-min-length.md) check captures the length of the shortest text and validates it using a rule parameter.
65
+
The *actual_value* field in the data quality check results will show the length of the shortest identified text.
66
+
67
+
- The [*text_max_length*](../checks/column/text/text-max-length.md) check captures the length of the longest text and validates it using a rule parameter.
68
+
The *actual_value* field in the data quality check results will show the length of the longest identified text.
69
+
70
+
- The [*text_mean_length*](../checks/column/text/text-mean-length.md) check calculates the average text length.
71
+
The mean length is validated by a rule, and it must be in the range of accepted values.
72
+
73
+
- The [*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md) check finds texts shorter than a given minimum length.
74
+
75
+
- The [*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md) check counts texts shorter than a given
76
+
minimum length and calculates a percentage of a too-short text in the whole column.
77
+
78
+
- The [*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md) check finds texts longer than a given maximum length.
79
+
80
+
- The [*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md) check counts texts longer than a given maximum length
81
+
and calculates a percentage of a too-short text in the whole column.
82
+
83
+
- The [*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md) check measures the percentage of valid texts whose length
84
+
is within a minimum and maximum accepted length.
85
+
86
+
### Profiling the text length
87
+
DQOps shows the text length statistics on the column's profile screen.
88
+
The values are:
89
+
90
+
- The **Text min length** shows the length of the shortest text in the column.
91
+
92
+
- The **Text max length** shows the length of the longest text in the column.
93
+
94
+
- The **Text mean length** shows the average text length.
95
+
96
+
{ loading=lazy }
97
+
98
+
## Minimum text length
99
+
The [*text_min_length*](../checks/column/text/text-min-length.md) has two optional rule parameters.
100
+
101
+
- The **from** parameter configures a minimum text length bottom range.
102
+
103
+
- The **to** parameter configures a minimum text length upper range.
104
+
105
+
### Verifying minimum text length in UI
106
+
The following screenshot shows configuring the [*text_min_length*](../checks/column/text/text-min-length.md) check
107
+
in the [DQOps data quality check editor](../dqo-concepts/dqops-user-interface-overview.md#check-editor).
108
+
109
+
{ loading=lazy }
110
+
111
+
### Verifying minimum text length in YAML
112
+
The configuration of the [*text_min_length*](../checks/column/text/text-min-length.md) check in YAML is simple.
that detects an increase or a decrease in the captured value (such as a maximum text length).
155
+
156
+
The following screenshot shows the configuration of a custom data quality check that detects changes
157
+
to the minimum or maximum text length.
158
+
159
+
{ loading=lazy }
10
160
11
161
## Use cases
12
162
|**Name of the example**|**Description**|
@@ -16,14 +166,14 @@ How to detect text data quality issues.
16
166
## List of text checks at a column level
17
167
| Data quality check name | Data quality dimension | Description | Standard check |
|[*text_max_length*](../checks/column/text/text-max-length.md)|Reasonableness|A column-level check that ensures that the length of text values in a column does not exceed the maximum accepted length.|:material-check-bold:|
20
-
|[*text_min_length*](../checks/column/text/text-min-length.md)|Reasonableness|A column-level check that ensures that the length of text in a column does not fall below the minimum accepted length.|:material-check-bold:|
21
-
|[*text_mean_length*](../checks/column/text/text-mean-length.md)|Reasonableness|A column-level check that ensures that the length of text values in a column does not exceed the mean accepted length.||
22
-
|[*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md)|Reasonableness|A column-level check that ensures that the number of text values in the monitored column with a length below the length defined by the user as a parameter does not exceed set thresholds.||
23
-
|[*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md)|Reasonableness|A column-level check that ensures that the percentage of text values in the monitored column with a length below the length defined by the user as a parameter does not fall below set thresholds.||
24
-
|[*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md)|Reasonableness|A column-level check that ensures that the number of text values in the monitored column with a length above the length defined by the user as a parameter does not exceed set thresholds.||
25
-
|[*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md)|Reasonableness|A column-level check that ensures that the percentage of text values in the monitored column with a length above the length defined by the user as a parameter does not fall below set thresholds.||
26
-
|[*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md)|Reasonableness|Column check that calculates the percentage of text values with a length below the indicated by the user length in a monitored column.||
169
+
|[*text_min_length*](../checks/column/text/text-min-length.md)|Reasonableness|This check finds the length of the shortest text in a column. DQOps validates the shortest length using a range rule. DQOps raises an issue when the minimum text length is outside a range of accepted values.|:material-check-bold:|
170
+
|[*text_max_length*](../checks/column/text/text-max-length.md)|Reasonableness|This check finds the length of the longest text in a column. DQOps validates the maximum length using a range rule. DQOps raises an issue when the maximum text length is outside a range of accepted values.|:material-check-bold:|
171
+
|[*text_mean_length*](../checks/column/text/text-mean-length.md)|Reasonableness|This check calculates the average text length in a column. DQOps validates the mean length using a range rule. DQOps raises an issue when the mean text length is outside a range of accepted values.||
172
+
|[*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md)|Reasonableness|This check finds texts that are shorter than the minimum accepted text length. It counts the number of texts that are too short and raises a data quality issue when too many invalid texts are found.||
173
+
|[*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md)|Reasonableness|This check finds texts that are shorter than the minimum accepted text length. It measures the percentage of too short texts and raises a data quality issue when too many invalid texts are found.||
174
+
|[*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md)|Reasonableness|This check finds texts that are longer than the maximum accepted text length. It counts the number of texts that are too long and raises a data quality issue when too many invalid texts are found.||
175
+
|[*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md)|Reasonableness|This check finds texts that are longer than the maximum accepted text length. It measures the percentage of texts that are too long and raises a data quality issue when too many invalid texts are found.||
176
+
|[*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md)|Reasonableness|This check verifies that the minimum and maximum lengths of text values are in the range of accepted values. It measures the percentage of texts with a valid length and raises a data quality issue when an insufficient number of texts have a valid length.||
Copy file name to clipboardexpand all lines: docs/checks/column/index.md
+17-10
Original file line number
Diff line number
Diff line change
@@ -705,43 +705,50 @@ A column-level check that detects if the data type of the column has changed sin
705
705
## column-level text checks
706
706
Validates that the data in a text column has a valid range.
707
707
708
-
### [text max length](./text/text-max-length.md)
709
-
A column-level check that ensures that the length of text values in a column does not exceed the maximum accepted length.
708
+
### [text min length](./text/text-min-length.md)
709
+
This check finds the length of the shortest text in a column. DQOps validates the shortest length using a range rule.
710
+
DQOps raises an issue when the minimum text length is outside a range of accepted values.
710
711
711
712
712
713
713
-
### [text min length](./text/text-min-length.md)
714
-
A column-level check that ensures that the length of text in a column does not fall below the minimum accepted length.
714
+
### [text max length](./text/text-max-length.md)
715
+
This check finds the length of the longest text in a column. DQOps validates the maximum length using a range rule.
716
+
DQOps raises an issue when the maximum text length is outside a range of accepted values.
715
717
716
718
717
719
718
720
### [text mean length](./text/text-mean-length.md)
719
-
A column-level check that ensures that the length of text values in a column does not exceed the mean accepted length.
721
+
This check calculates the average text length in a column. DQOps validates the mean length using a range rule.
722
+
DQOps raises an issue when the mean text length is outside a range of accepted values.
720
723
721
724
722
725
723
726
### [text length below min length](./text/text-length-below-min-length.md)
724
-
A column-level check that ensures that the number of text values in the monitored column with a length below the length defined by the user as a parameter does not exceed set thresholds.
727
+
This check finds texts that are shorter than the minimum accepted text length. It counts the number of texts that are too short and raises a data quality issue when too many invalid texts are found.
725
728
726
729
727
730
728
731
### [text length below min length percent](./text/text-length-below-min-length-percent.md)
729
-
A column-level check that ensures that the percentage of text values in the monitored column with a length below the length defined by the user as a parameter does not fall below set thresholds.
732
+
This check finds texts that are shorter than the minimum accepted text length.
733
+
It measures the percentage of too short texts and raises a data quality issue when too many invalid texts are found.
730
734
731
735
732
736
733
737
### [text length above max length](./text/text-length-above-max-length.md)
734
-
A column-level check that ensures that the number of text values in the monitored column with a length above the length defined by the user as a parameter does not exceed set thresholds.
738
+
This check finds texts that are longer than the maximum accepted text length.
739
+
It counts the number of texts that are too long and raises a data quality issue when too many invalid texts are found.
735
740
736
741
737
742
738
743
### [text length above max length percent](./text/text-length-above-max-length-percent.md)
739
-
A column-level check that ensures that the percentage of text values in the monitored column with a length above the length defined by the user as a parameter does not fall below set thresholds.
744
+
This check finds texts that are longer than the maximum accepted text length.
745
+
It measures the percentage of texts that are too long and raises a data quality issue when too many invalid texts are found.
740
746
741
747
742
748
743
749
### [text length in range percent](./text/text-length-in-range-percent.md)
744
-
Column check that calculates the percentage of text values with a length below the indicated by the user length in a monitored column.
750
+
This check verifies that the minimum and maximum lengths of text values are in the range of accepted values.
751
+
It measures the percentage of texts with a valid length and raises a data quality issue when an insufficient number of texts have a valid length.
0 commit comments