Skip to content

Commit 7a4fa87

Browse files
[Feature] Update is_(not)_in_range (#87) to support max/min limits from col (#153)
## Changes Changes in the is_in_range and is_not_in_range column function to handle a column as the min/max limit along with the literal value. ### Linked issues Resolves #87 ### Tests - [X] manually tested - [x] added unit tests - [X] added integration tests --------- Co-authored-by: Marcin Wojtyczka <marcin.wojtyczka@databricks.com>
1 parent c6d9c5f commit 7a4fa87

File tree

3 files changed

+105
-58
lines changed

3 files changed

+105
-58
lines changed

docs/dqx/docs/reference.mdx

+20-20
Original file line numberDiff line numberDiff line change
@@ -13,26 +13,26 @@ This page provides a reference for the quality rule functions (checks) available
1313

1414
The following quality rules / functions are currently available:
1515

16-
| Check | Description | Arguments |
17-
| -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
18-
| is_not_null | Check if input column is not null | col_name: column name to check |
19-
| is_not_empty | Check if input column is not empty | col_name: column name to check |
20-
| is_not_null_and_not_empty | Check if input column is not null or empty | col_name: column name to check; trim_strings: boolean flag to trim spaces from strings |
21-
| value_is_in_list | Check if the provided value is present in the input column. | col_name: column name to check; allowed: list of allowed values |
22-
| value_is_not_null_and_is_in_list | Check if provided value is present if the input column is not null | col_name: column name to check; allowed: list of allowed values |
23-
| is_not_null_and_not_empty_array | Check if input array column is not null or empty | col_name: column name to check |
24-
| is_in_range | Check if input column is in the provided range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit; max_limit: max limit |
25-
| is_not_in_range | Check if input column is not within defined range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value |
26-
| not_less_than | Check if input column is not less than the provided limit | col_name: column name to check; limit: limit value |
27-
| not_greater_than | Check if input column is not greater than the provided limit | col_name: column name to check; limit: limit value |
28-
| is_valid_date | Check if input column is a valid date | col_name: column name to check; date_format: date format (e.g. 'yyyy-mm-dd') |
29-
| is_valid_timestamp | Check if input column is a valid timestamp | col_name: column name to check; timestamp_format: timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
30-
| not_in_future | Check if input column defined as date is not in the future (future defined as current_timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
31-
| not_in_near_future | Check if input column defined as date is not in the near future (near future defined as grater than current timestamp but less than current timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
32-
| is_older_than_n_days | Check if input column is older than n number of days | col_name: column name to check; days: number of days; curr_date: current date, if not provided current_date() function is used |
33-
| is_older_than_col2_for_n_days | Check if one column is not older than another column by n number of days | col_name1: first column name to check; col_name2: second column name to check; days: number of days |
34-
| regex_match | Check if input column matches a given regex | col_name: column name to check; regex: regex to check; negate: if the condition should be negated (true) or not |
35-
| sql_expression | Check if input column is matches the provided sql expression, eg. a = 'str1', a > b | expression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated |
16+
| Check | Description | Arguments |
17+
| -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
18+
| is_not_null | Check if input column is not null | col_name: column name to check |
19+
| is_not_empty | Check if input column is not empty | col_name: column name to check |
20+
| is_not_null_and_not_empty | Check if input column is not null or empty | col_name: column name to check; trim_strings: boolean flag to trim spaces from strings |
21+
| value_is_in_list | Check if the provided value is present in the input column. | col_name: column name to check; allowed: list of allowed values |
22+
| value_is_not_null_and_is_in_list | Check if provided value is present if the input column is not null | col_name: column name to check; allowed: list of allowed values |
23+
| is_not_null_and_not_empty_array | Check if input array column is not null or empty | col_name: column name to check |
24+
| is_in_range | Check if input column is in the provided range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value; min_limit_col_expr: min limit column name or expr; max_limit_col_expr: max limit column name or expr |
25+
| is_not_in_range | Check if input column is not within defined range (inclusive of both boundaries) | col_name: column name to check; min_limit: min limit value; max_limit: max limit value; min_limit_col_expr: min limit column name or expr; max_limit_col_expr: max limit column name or expr |
26+
| not_less_than | Check if input column is not less than the provided limit | col_name: column name to check; limit: limit value |
27+
| not_greater_than | Check if input column is not greater than the provided limit | col_name: column name to check; limit: limit value |
28+
| is_valid_date | Check if input column is a valid date | col_name: column name to check; date_format: date format (e.g. 'yyyy-mm-dd') |
29+
| is_valid_timestamp | Check if input column is a valid timestamp | col_name: column name to check; timestamp_format: timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
30+
| not_in_future | Check if input column defined as date is not in the future (future defined as current_timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
31+
| not_in_near_future | Check if input column defined as date is not in the near future (near future defined as grater than current timestamp but less than current timestamp + offset) | col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
32+
| is_older_than_n_days | Check if input column is older than n number of days | col_name: column name to check; days: number of days; curr_date: current date, if not provided current_date() function is used |
33+
| is_older_than_col2_for_n_days | Check if one column is not older than another column by n number of days | col_name1: first column name to check; col_name2: second column name to check; days: number of days |
34+
| regex_match | Check if input column matches a given regex | col_name: column name to check; regex: regex to check; negate: if the condition should be negated (true) or not |
35+
| sql_expression | Check if input column is matches the provided sql expression, eg. a = 'str1', a > b | expression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated |
3636

3737
You can check implementation details of the rules [here](https://github.com/databrickslabs/dqx/blob/main/src/databricks/labs/dqx/col_functions.py).
3838

src/databricks/labs/dqx/col_functions.py

+54-16
Original file line numberDiff line numberDiff line change
@@ -281,20 +281,53 @@ def not_greater_than(col_name: str, limit: int | datetime.date | datetime.dateti
281281
)
282282

283283

284+
def _get_min_max_column_expr(
285+
min_limit: int | datetime.date | datetime.datetime | str | None = None,
286+
max_limit: int | datetime.date | datetime.datetime | str | None = None,
287+
min_limit_col_expr: str | Column | None = None,
288+
max_limit_col_expr: str | Column | None = None,
289+
) -> tuple[Column, Column]:
290+
"""Helper function to create a condition for the is_(not)_in_range functions.
291+
292+
:param min_limit: min limit value
293+
:param max_limit: max limit value
294+
:param min_limit_col_expr: min limit column name or expr
295+
:param max_limit_col_expr: max limit column name or expr
296+
:return: tuple containing min_limit_expr and max_limit_expr
297+
:raises: ValueError when both min_limit/min_limit_col_expr or max_limit/max_limit_col_expr are null
298+
"""
299+
if (min_limit is None and min_limit_col_expr is None) or (max_limit is None and max_limit_col_expr is None):
300+
raise ValueError('Either min_limit / min_limit_col_expr or max_limit / max_limit_col_expr is empty')
301+
if min_limit_col_expr is None:
302+
min_limit_expr = F.lit(min_limit)
303+
else:
304+
min_limit_expr = F.col(min_limit_col_expr) if isinstance(min_limit_col_expr, str) else min_limit_col_expr
305+
if max_limit_col_expr is None:
306+
max_limit_expr = F.lit(max_limit)
307+
else:
308+
max_limit_expr = F.col(max_limit_col_expr) if isinstance(max_limit_col_expr, str) else max_limit_col_expr
309+
return (min_limit_expr, max_limit_expr)
310+
311+
284312
def is_in_range(
285313
col_name: str,
286-
min_limit: int | datetime.date | datetime.datetime,
287-
max_limit: int | datetime.date | datetime.datetime,
314+
min_limit: int | datetime.date | datetime.datetime | str | None = None,
315+
max_limit: int | datetime.date | datetime.datetime | str | None = None,
316+
min_limit_col_expr: str | Column | None = None,
317+
max_limit_col_expr: str | Column | None = None,
288318
) -> Column:
289319
"""Creates a condition column that checks if a value is smaller than min limit or greater than max limit.
290320
291321
:param col_name: column name
292-
:param min_limit: min limit
293-
:param max_limit: max limit
322+
:param min_limit: min limit value
323+
:param max_limit: max limit value
324+
:param min_limit_col_expr: min limit column name or expr
325+
:param max_limit_col_expr: max limit column name or expr
294326
:return: new Column
295327
"""
296-
min_limit_expr = F.lit(min_limit)
297-
max_limit_expr = F.lit(max_limit)
328+
min_limit_expr, max_limit_expr = _get_min_max_column_expr(
329+
min_limit, max_limit, min_limit_col_expr, max_limit_col_expr
330+
)
298331
condition = (F.col(col_name) < min_limit_expr) | (F.col(col_name) > max_limit_expr)
299332

300333
return make_condition(
@@ -304,9 +337,9 @@ def is_in_range(
304337
F.lit("Value"),
305338
F.col(col_name),
306339
F.lit("not in range: ["),
307-
F.lit(min_limit).cast("string"),
340+
min_limit_expr.cast("string"),
308341
F.lit(","),
309-
F.lit(max_limit).cast("string"),
342+
max_limit_expr.cast("string"),
310343
F.lit("]"),
311344
),
312345
f"{col_name}_not_in_range",
@@ -315,18 +348,23 @@ def is_in_range(
315348

316349
def is_not_in_range(
317350
col_name: str,
318-
min_limit: int | datetime.date | datetime.datetime,
319-
max_limit: int | datetime.date | datetime.datetime,
351+
min_limit: int | datetime.date | datetime.datetime | str | None = None,
352+
max_limit: int | datetime.date | datetime.datetime | str | None = None,
353+
min_limit_col_expr: str | Column | None = None,
354+
max_limit_col_expr: str | Column | None = None,
320355
) -> Column:
321356
"""Creates a condition column that checks if a value is within min and max limits.
322357
323358
:param col_name: column name
324-
:param min_limit: min limit
325-
:param max_limit: max limit
359+
:param min_limit: min limit value
360+
:param max_limit: max limit value
361+
:param min_limit_col_expr: min limit column name or expr
362+
:param max_limit_col_expr: max limit column name or expr
326363
:return: new Column
327364
"""
328-
min_limit_expr = F.lit(min_limit)
329-
max_limit_expr = F.lit(max_limit)
365+
min_limit_expr, max_limit_expr = _get_min_max_column_expr(
366+
min_limit, max_limit, min_limit_col_expr, max_limit_col_expr
367+
)
330368
condition = (F.col(col_name) > min_limit_expr) & (F.col(col_name) < max_limit_expr)
331369

332370
return make_condition(
@@ -336,9 +374,9 @@ def is_not_in_range(
336374
F.lit("Value"),
337375
F.col(col_name),
338376
F.lit("in range: ["),
339-
F.lit(min_limit).cast("string"),
377+
min_limit_expr.cast("string"),
340378
F.lit(","),
341-
F.lit(max_limit).cast("string"),
379+
max_limit_expr.cast("string"),
342380
F.lit("]"),
343381
),
344382
f"{col_name}_in_range",

0 commit comments

Comments
 (0)