Skip to content

Commit 79cfe7c

Browse files
committed
The guide of using text statistics checks. Also the text min length and text max length checks now are configured by a range.
1 parent e219741 commit 79cfe7c

File tree

55 files changed

+778
-548
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+778
-548
lines changed

docs/categories-of-data-quality-checks/how-to-detect-data-quality-issues-in-text-fields.md

+165-15
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,162 @@
1-
# Detecting data quality issues with text
2-
Read this guide to learn what types of data quality checks are supported in DQOps to detect issues related to text.
3-
The data quality checks are configured in the `text` category in DQOps.
1+
# Detecting out-of-range text values
2+
Read this guide to learn how to find text values that are too short or too long, which are most likely invalid values stored in a database.
43

5-
## Text category
6-
Data quality checks that are detecting issues related to text are listed below.
4+
The data quality checks that detect issues with too short or too long texts are configured in the `text` category in DQOps.
75

8-
## Detecting text issues
9-
How to detect text data quality issues.
6+
## Text statistics
7+
The statistics about text values are pretty simple.
8+
We can analyze the length of text values and find the shorted or longest text values.
9+
The statistics around text values are less sophisticated than calculating metrics for numeric values,
10+
but observing the length of strings can still reveal many data quality issues.
11+
12+
### Issues with too short texts
13+
Texts shorter than a reasonable minimum are a possible case of potential data corruption.
14+
We can't expect a phone number to be only two digits.
15+
Or an email that has just two letters must be wrong. It is not even enough to include the domain name.
16+
17+
Too short texts are a result of:
18+
19+
- Someone accidentally truncated the text manually.
20+
21+
- A user entered incomplete values during data entry.
22+
23+
- The data was corrupted in transport.
24+
25+
- The data loading was interrupted because the platform ran out of disk space.
26+
27+
- A bug in the transformation logic truncated the text.
28+
29+
- The temporary variable or a column in a temporary table was too short, causing truncation.
30+
31+
To detect these types of possible data corruption, we can choose a reasonable minimum text length
32+
that should fit the smallest valid value, such as a phone number.
33+
34+
Truncated texts that are too short lead to the completeness and uniqueness issues.
35+
If more identifiers are also truncated, we will have duplicate data.
36+
37+
### Issues with too long texts
38+
Texts that are longer than expected are caused by other problems.
39+
40+
- The data is corrupted.
41+
42+
- Column values were concatenated together because a wrong separator was used to load the values.
43+
44+
- The column is used for different purposes than it was designed for.
45+
Users use the phone column to enter additional comments about how to contact a person.
46+
47+
- The maximum column length was incorrectly estimated and unable to fit valid values, such as address lines.
48+
49+
Texts that are too long can also cause other problems.
50+
51+
- They take more storage.
52+
53+
- Indexes are growing too big, and queries take longer to run.
54+
55+
- The target column length is shorter, and valid texts will be truncated.
56+
57+
It is wise to find out how long the longest text should be that we plan to store in a column.
58+
A data quality check should monitor the data to find texts that are longer than expected.
59+
60+
## Text length checks
61+
DQOps has several data quality checks for validating the length of texts.
62+
They are very similar to each other, but they can still detect different types of length issues.
63+
64+
- The [*text_min_length*](../checks/column/text/text-min-length.md) check captures the length of the shortest text and validates it using a rule parameter.
65+
The *actual_value* field in the data quality check results will show the length of the shortest identified text.
66+
67+
- The [*text_max_length*](../checks/column/text/text-max-length.md) check captures the length of the longest text and validates it using a rule parameter.
68+
The *actual_value* field in the data quality check results will show the length of the longest identified text.
69+
70+
- The [*text_mean_length*](../checks/column/text/text-mean-length.md) check calculates the average text length.
71+
The mean length is validated by a rule, and it must be in the range of accepted values.
72+
73+
- The [*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md) check finds texts shorter than a given minimum length.
74+
75+
- The [*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md) check counts texts shorter than a given
76+
minimum length and calculates a percentage of a too-short text in the whole column.
77+
78+
- The [*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md) check finds texts longer than a given maximum length.
79+
80+
- The [*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md) check counts texts longer than a given maximum length
81+
and calculates a percentage of a too-short text in the whole column.
82+
83+
- The [*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md) check measures the percentage of valid texts whose length
84+
is within a minimum and maximum accepted length.
85+
86+
### Profiling the text length
87+
DQOps shows the text length statistics on the column's profile screen.
88+
The values are:
89+
90+
- The **Text min length** shows the length of the shortest text in the column.
91+
92+
- The **Text max length** shows the length of the longest text in the column.
93+
94+
- The **Text mean length** shows the average text length.
95+
96+
![Data profiling a text column length in DQOps](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/data-profiling-text-column-length-in-dqops-min.png){ loading=lazy }
97+
98+
## Minimum text length
99+
The [*text_min_length*](../checks/column/text/text-min-length.md) has two optional rule parameters.
100+
101+
- The **from** parameter configures a minimum text length bottom range.
102+
103+
- The **to** parameter configures a minimum text length upper range.
104+
105+
### Verifying minimum text length in UI
106+
The following screenshot shows configuring the [*text_min_length*](../checks/column/text/text-min-length.md) check
107+
in the [DQOps data quality check editor](../dqo-concepts/dqops-user-interface-overview.md#check-editor).
108+
109+
![Configuring minimum text length in range data quality check in DQOps](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/min-text-length-in-range-data-quality-check-in-dqops-min.png){ loading=lazy }
110+
111+
### Verifying minimum text length in YAML
112+
The configuration of the [*text_min_length*](../checks/column/text/text-min-length.md) check in YAML is simple.
113+
114+
``` { .yaml linenums="1" hl_lines="13-16" }
115+
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
116+
apiVersion: dqo/v1
117+
kind: table
118+
spec:
119+
columns:
120+
state_name:
121+
type_snapshot:
122+
column_type: STRING
123+
nullable: true
124+
monitoring_checks:
125+
daily:
126+
text:
127+
daily_text_min_length:
128+
warning:
129+
from: 4
130+
to: 6
131+
```
132+
133+
## Text length anomalies
134+
DQOps does not provide built-in anomaly detection checks for the text length because
135+
that would extend the list of supported data quality checks, making the platform too complex to learn.
136+
137+
Instead, it is effortless to customize the built-in checks by combining the sensors:
138+
139+
- [*column/text/text_min_length*](../reference/sensors/column/text-column-sensors.md#text-min-length)
140+
sensor that finds the length of the shortest text,
141+
142+
- [*column/text/text_max_length*](../reference/sensors/column/text-column-sensors.md#text-max-length)
143+
sensor that finds the length of the longest text,
144+
145+
- [*column/text/text_mean_length*](../reference/sensors/column/text-column-sensors.md#text-mean-length)
146+
sensor that calculates the average length.
147+
148+
And one of the anomaly or change detection rules:
149+
150+
- [*percentile/anomaly_stationary_percentile_moving_average*](../reference/rules/Percentile.md#anomaly-stationary-percentile-moving-average)
151+
that finds anomalies in a 90 days time window,
152+
153+
- [*change/change_percent_1_day*](../reference/rules/Change.md#change-percent-1-day)
154+
that detects an increase or a decrease in the captured value (such as a maximum text length).
155+
156+
The following screenshot shows the configuration of a custom data quality check that detects changes
157+
to the minimum or maximum text length.
158+
159+
![Creating custom data quality check that detects anomalies in the maximum text length](https://dqops.com/docs/images/concepts/categories-of-data-quality-checks/creating-custom-text-maximum-length-anomaly-detection-check-dqops-min.png){ loading=lazy }
10160

11161
## Use cases
12162
| **Name of the example** | **Description** |
@@ -16,14 +166,14 @@ How to detect text data quality issues.
16166
## List of text checks at a column level
17167
| Data quality check name | Data quality dimension | Description | Standard check |
18168
|-------------------------|------------------------|-------------|-------|
19-
|[*text_max_length*](../checks/column/text/text-max-length.md)|Reasonableness|A column-level check that ensures that the length of text values in a column does not exceed the maximum accepted length.|:material-check-bold:|
20-
|[*text_min_length*](../checks/column/text/text-min-length.md)|Reasonableness|A column-level check that ensures that the length of text in a column does not fall below the minimum accepted length.|:material-check-bold:|
21-
|[*text_mean_length*](../checks/column/text/text-mean-length.md)|Reasonableness|A column-level check that ensures that the length of text values in a column does not exceed the mean accepted length.| |
22-
|[*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md)|Reasonableness|A column-level check that ensures that the number of text values in the monitored column with a length below the length defined by the user as a parameter does not exceed set thresholds.| |
23-
|[*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md)|Reasonableness|A column-level check that ensures that the percentage of text values in the monitored column with a length below the length defined by the user as a parameter does not fall below set thresholds.| |
24-
|[*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md)|Reasonableness|A column-level check that ensures that the number of text values in the monitored column with a length above the length defined by the user as a parameter does not exceed set thresholds.| |
25-
|[*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md)|Reasonableness|A column-level check that ensures that the percentage of text values in the monitored column with a length above the length defined by the user as a parameter does not fall below set thresholds.| |
26-
|[*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md)|Reasonableness|Column check that calculates the percentage of text values with a length below the indicated by the user length in a monitored column.| |
169+
|[*text_min_length*](../checks/column/text/text-min-length.md)|Reasonableness|This check finds the length of the shortest text in a column. DQOps validates the shortest length using a range rule. DQOps raises an issue when the minimum text length is outside a range of accepted values.|:material-check-bold:|
170+
|[*text_max_length*](../checks/column/text/text-max-length.md)|Reasonableness|This check finds the length of the longest text in a column. DQOps validates the maximum length using a range rule. DQOps raises an issue when the maximum text length is outside a range of accepted values.|:material-check-bold:|
171+
|[*text_mean_length*](../checks/column/text/text-mean-length.md)|Reasonableness|This check calculates the average text length in a column. DQOps validates the mean length using a range rule. DQOps raises an issue when the mean text length is outside a range of accepted values.| |
172+
|[*text_length_below_min_length*](../checks/column/text/text-length-below-min-length.md)|Reasonableness|This check finds texts that are shorter than the minimum accepted text length. It counts the number of texts that are too short and raises a data quality issue when too many invalid texts are found.| |
173+
|[*text_length_below_min_length_percent*](../checks/column/text/text-length-below-min-length-percent.md)|Reasonableness|This check finds texts that are shorter than the minimum accepted text length. It measures the percentage of too short texts and raises a data quality issue when too many invalid texts are found.| |
174+
|[*text_length_above_max_length*](../checks/column/text/text-length-above-max-length.md)|Reasonableness|This check finds texts that are longer than the maximum accepted text length. It counts the number of texts that are too long and raises a data quality issue when too many invalid texts are found.| |
175+
|[*text_length_above_max_length_percent*](../checks/column/text/text-length-above-max-length-percent.md)|Reasonableness|This check finds texts that are longer than the maximum accepted text length. It measures the percentage of texts that are too long and raises a data quality issue when too many invalid texts are found.| |
176+
|[*text_length_in_range_percent*](../checks/column/text/text-length-in-range-percent.md)|Reasonableness|This check verifies that the minimum and maximum lengths of text values are in the range of accepted values. It measures the percentage of texts with a valid length and raises a data quality issue when an insufficient number of texts have a valid length.| |
27177

28178

29179
**Reference and samples**

docs/checks/column/index.md

+17-10
Original file line numberDiff line numberDiff line change
@@ -705,43 +705,50 @@ A column-level check that detects if the data type of the column has changed sin
705705
## column-level text checks
706706
Validates that the data in a text column has a valid range.
707707

708-
### [text max length](./text/text-max-length.md)
709-
A column-level check that ensures that the length of text values in a column does not exceed the maximum accepted length.
708+
### [text min length](./text/text-min-length.md)
709+
This check finds the length of the shortest text in a column. DQOps validates the shortest length using a range rule.
710+
DQOps raises an issue when the minimum text length is outside a range of accepted values.
710711

711712

712713

713-
### [text min length](./text/text-min-length.md)
714-
A column-level check that ensures that the length of text in a column does not fall below the minimum accepted length.
714+
### [text max length](./text/text-max-length.md)
715+
This check finds the length of the longest text in a column. DQOps validates the maximum length using a range rule.
716+
DQOps raises an issue when the maximum text length is outside a range of accepted values.
715717

716718

717719

718720
### [text mean length](./text/text-mean-length.md)
719-
A column-level check that ensures that the length of text values in a column does not exceed the mean accepted length.
721+
This check calculates the average text length in a column. DQOps validates the mean length using a range rule.
722+
DQOps raises an issue when the mean text length is outside a range of accepted values.
720723

721724

722725

723726
### [text length below min length](./text/text-length-below-min-length.md)
724-
A column-level check that ensures that the number of text values in the monitored column with a length below the length defined by the user as a parameter does not exceed set thresholds.
727+
This check finds texts that are shorter than the minimum accepted text length. It counts the number of texts that are too short and raises a data quality issue when too many invalid texts are found.
725728

726729

727730

728731
### [text length below min length percent](./text/text-length-below-min-length-percent.md)
729-
A column-level check that ensures that the percentage of text values in the monitored column with a length below the length defined by the user as a parameter does not fall below set thresholds.
732+
This check finds texts that are shorter than the minimum accepted text length.
733+
It measures the percentage of too short texts and raises a data quality issue when too many invalid texts are found.
730734

731735

732736

733737
### [text length above max length](./text/text-length-above-max-length.md)
734-
A column-level check that ensures that the number of text values in the monitored column with a length above the length defined by the user as a parameter does not exceed set thresholds.
738+
This check finds texts that are longer than the maximum accepted text length.
739+
It counts the number of texts that are too long and raises a data quality issue when too many invalid texts are found.
735740

736741

737742

738743
### [text length above max length percent](./text/text-length-above-max-length-percent.md)
739-
A column-level check that ensures that the percentage of text values in the monitored column with a length above the length defined by the user as a parameter does not fall below set thresholds.
744+
This check finds texts that are longer than the maximum accepted text length.
745+
It measures the percentage of texts that are too long and raises a data quality issue when too many invalid texts are found.
740746

741747

742748

743749
### [text length in range percent](./text/text-length-in-range-percent.md)
744-
Column check that calculates the percentage of text values with a length below the indicated by the user length in a monitored column.
750+
This check verifies that the minimum and maximum lengths of text values are in the range of accepted values.
751+
It measures the percentage of texts with a valid length and raises a data quality issue when an insufficient number of texts have a valid length.
745752

746753

747754

0 commit comments

Comments
 (0)