-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: ♻️ check column data types #1111
base: main
Are you sure you want to change the base?
Changes from all commits
cfbe7df
3dc468f
3d4ee13
6380a93
ebdac87
f79e0d2
6aee96c
a22834d
0db660b
6f7df53
14a28a5
950a618
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did you rename it to Might be good to rename all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From just data types? I thought that the function checking all data types in all columns in the data frame could be called Good point about calling it one thing! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
import json | ||
|
||
import polars as pl | ||
|
||
from seedcase_sprout.core.map_data_types import FRICTIONLESS_TO_POLARS | ||
from seedcase_sprout.core.properties import FieldType | ||
|
||
# https://datapackage.org/standard/table-schema/#boolean | ||
BOOLEAN_VALUES = {"false", "False", "FALSE", "0", "true", "True", "TRUE", "1"} | ||
|
||
|
||
def check_is_boolean(column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only boolean values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return pl.col(column_name).is_in(BOOLEAN_VALUES) | ||
|
||
|
||
def check_is_type_by_conversion(column_name: str, data_type: FieldType) -> pl.Expr: | ||
"""Checks if the column contains only values of the given type. | ||
|
||
The check is done by attempting to convert (cast) the column to the | ||
appropriate Polars data type. If it fails, the values are marked with | ||
False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
data_type: The type of the column. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return ( | ||
pl.col(column_name) | ||
# Strict is false means map to None if it fails rather than give an error. | ||
.cast(FRICTIONLESS_TO_POLARS[data_type], strict=False) | ||
.is_not_null() | ||
) | ||
|
||
|
||
def check_is_yearmonth(column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only yearmonth values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return ( | ||
# Fail negative values starting with `-`. | ||
pl.col(column_name).str.starts_with("-").not_() | ||
& (pl.col(column_name) + "-01") | ||
# Strict is false means map to None if it fails rather than give an error. | ||
.str.to_date(format="%Y-%m-%d", strict=False) | ||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||
.is_not_null() | ||
) | ||
|
||
|
||
def check_is_datetime(data_frame: pl.DataFrame, column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only datetime values. | ||
|
||
Mixing values with and without timezone information is not allowed. | ||
Mixing values with different timezones is allowed, as they will be standardised | ||
before saving to Parquet. | ||
Comment on lines
+72
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How we handle these cases is very much up for discussion.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, this is tricky. I think failing if there is mixing is a good idea. What about if there is no timezone information at all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all datetimes are timezone-naive (= no timezone given) that's okay, this check will pass. Parquet is also happy with timezone-naive datetimes. You just cannot mix timezone-naive and timezone-aware datetimes. |
||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
data_frame: The data frame being operated on. | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
first_datetime = ( | ||
data_frame.get_column(column_name) | ||
.drop_nulls() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why drop nulls at this point? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because if the column is, e.g., |
||
# Strict is false means map to None if it fails rather than give an error. | ||
.str.to_datetime(strict=False) | ||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||
.first() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. first is taking the first value? Is this to just get to first datetime in the column? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah! All this preamble is just to determine if the column is timezone-naive or timezone-aware. I'm not the biggest fan of it, so I'm very open to other ideas 😊 |
||
) | ||
has_timezone = bool(first_datetime.tzinfo) if first_datetime else False | ||
datetime_format = "%Y-%m-%dT%H:%M:%S%.f" + ("%z" if has_timezone else "") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm explicitly setting the format because otherwise it passes datetimes that are not in the format specified by Frictionless as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this line not add timezone if it isn't found? via the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah: |
||
|
||
return ( | ||
# Fail negative values starting with `-`. | ||
pl.col(column_name).str.starts_with("-").not_() | ||
& pl.col(column_name) | ||
.str.replace("Z", "+00:00") | ||
.str.to_datetime(time_unit="ms", format=datetime_format, strict=False) | ||
.dt.convert_time_zone(time_zone="UTC") | ||
.is_not_null() | ||
) | ||
|
||
|
||
def check_is_date(column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only date values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return ( | ||
# Fail negative values starting with `-`. | ||
pl.col(column_name).str.starts_with("-").not_() | ||
& pl.col(column_name).str.to_date(format="%Y-%m-%d", strict=False).is_not_null() | ||
) | ||
|
||
|
||
def check_is_time(column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only time values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return ( | ||
pl.col(column_name) | ||
.str.to_time(format="%H:%M:%S%.f", strict=False) | ||
.is_not_null() | ||
) | ||
|
||
|
||
# https://stackoverflow.com/a/18690202 | ||
GEOPOINT_PATTERN = ( | ||
r"^(?:[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)),\s*" | ||
r"(?:[-+]?(?:180(?:\.0+)?|(?:1[0-7]\d|[1-9]?\d)(?:\.\d+)?))$" | ||
) | ||
Comment on lines
+142
to
+146
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! haha |
||
|
||
|
||
def check_is_geopoint(column_name: str) -> pl.Expr: | ||
"""Checks if the column contains only geopoint values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
column_name: The name of the column to check. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return pl.col(column_name).str.contains(GEOPOINT_PATTERN) | ||
|
||
|
||
def check_is_json(column_name: str, expected_type: type[list | dict]) -> pl.Expr: | ||
"""Checks if the column contains only JSON values. | ||
|
||
Failed values are marked with False. | ||
|
||
Warning: uses `map_elements` to check the formatting of each value and may run | ||
slowly on large datasets. | ||
Comment on lines
+168
to
+169
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't find a way of checking JSON using only Polars expressions. There is a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fine, we don't really need to support that for now. |
||
|
||
Args: | ||
column_name: The name of the column to check. | ||
expected_type: The expected JSON type: an object or an array. | ||
|
||
Returns: | ||
A Polars expression for checking the column. | ||
""" | ||
return pl.col(column_name).map_elements( | ||
lambda value: check_value_is_json(value, expected_type), | ||
return_dtype=pl.Boolean, | ||
) | ||
|
||
|
||
def check_value_is_json(value: str, expected_type: type[list | dict]) -> bool: | ||
"""Checks if the `value` is correctly formatted as a JSON object or array. | ||
|
||
Args: | ||
value: The value to check. | ||
expected_type: The expected JSON type: an object or an array. | ||
|
||
Returns: | ||
True if the value is a correct JSON type, False otherwise. | ||
""" | ||
try: | ||
return isinstance(json.loads(value), expected_type) | ||
except json.JSONDecodeError: | ||
return False | ||
|
||
|
||
FRICTIONLESS_TO_COLUMN_CHECK = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Used here |
||
"boolean": check_is_boolean, | ||
"integer": lambda col_name: check_is_type_by_conversion(col_name, "integer"), | ||
"number": lambda col_name: check_is_type_by_conversion(col_name, "number"), | ||
"year": lambda col_name: check_is_type_by_conversion(col_name, "year"), | ||
"yearmonth": check_is_yearmonth, | ||
"datetime": check_is_datetime, | ||
"date": check_is_date, | ||
"time": check_is_time, | ||
"geopoint": check_is_geopoint, | ||
"array": lambda value: check_is_json(value, list), | ||
"object": lambda value: check_is_json(value, dict), | ||
"geojson": lambda value: check_is_json(value, dict), | ||
"string": lambda _: pl.lit(True), | ||
"any": lambda _: pl.lit(True), | ||
"duration": lambda _: pl.lit(True), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should I spend time writing a Polars-compatible regex for Frictionless durations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No hahah |
||
} |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these functions map incorrect values to False and correct values to True. This way, we can run all checks on all values without terminating early due to an error. The context for these functions will show that the mapping is added in another column.