-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: ♻️ check column data types #1111
base: main
Are you sure you want to change the base?
Changes from 6 commits
cfbe7df
3dc468f
3d4ee13
6380a93
ebdac87
f79e0d2
6aee96c
a22834d
0db660b
6f7df53
14a28a5
950a618
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did you rename it to Might be good to rename all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From just data types? I thought that the function checking all data types in all columns in the data frame could be called Good point about calling it one thing! |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,216 @@ | ||||||||
import json | ||||||||
|
||||||||
import polars as pl | ||||||||
|
||||||||
from seedcase_sprout.core.map_data_types import FRICTIONLESS_TO_POLARS | ||||||||
from seedcase_sprout.core.properties import FieldType | ||||||||
|
||||||||
# https://datapackage.org/standard/table-schema/#boolean | ||||||||
BOOLEAN_VALUES = {"false", "False", "FALSE", "0", "true", "True", "TRUE", "1"} | ||||||||
|
||||||||
|
||||||||
def check_is_boolean(field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only boolean values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be good to be consistent with the language (and/or clarify what each means). There's both field and column being mentioned. Maybe you can say "check if field (also called column)..."? Either way, we should be consistent here and also explicit what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy to use |
||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return pl.col(field_name).is_in(BOOLEAN_VALUES) | ||||||||
|
||||||||
|
||||||||
def check_is_castable_type(field_name: str, field_type: FieldType) -> pl.Expr: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
castable isn't super clear what it means. Maybe "given"? But I don't love that either ... 🤔 Or just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe |
||||||||
"""Checks if the column contains only values of the given type. | ||||||||
|
||||||||
The check is done by attempting to cast to the appropriate Polars data type. | ||||||||
Failed values are marked with False. | ||||||||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
field_type: The type of the field. | ||||||||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return ( | ||||||||
pl.col(field_name) | ||||||||
.cast(FRICTIONLESS_TO_POLARS[field_type], strict=False) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here and below
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
.is_not_null() | ||||||||
) | ||||||||
|
||||||||
|
||||||||
def check_is_yearmonth(field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only yearmonth values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return ( | ||||||||
pl.col(field_name).str.starts_with("-").not_() | ||||||||
& (pl.col(field_name) + "-01") | ||||||||
.fill_null("0001-01-01") | ||||||||
.str.to_date(format="%Y-%m-%d", strict=False) | ||||||||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
.is_not_null() | ||||||||
) | ||||||||
|
||||||||
|
||||||||
def check_is_datetime(data_frame: pl.DataFrame, field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only datetime values. | ||||||||
|
||||||||
Mixing values with and without timezone information is not allowed. | ||||||||
Mixing values with different timezones is allowed, as they will be standardised | ||||||||
before saving to Parquet. | ||||||||
Comment on lines
+72
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How we handle these cases is very much up for discussion.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, this is tricky. I think failing if there is mixing is a good idea. What about if there is no timezone information at all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If all datetimes are timezone-naive (= no timezone given) that's okay, this check will pass. Parquet is also happy with timezone-naive datetimes. You just cannot mix timezone-naive and timezone-aware datetimes. |
||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
data_frame: The data frame being operated on. | ||||||||
field_name: The name of the column to check. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
first_datetime = ( | ||||||||
data_frame.get_column(field_name) | ||||||||
.drop_nulls() | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why drop nulls at this point? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because if the column is, e.g., |
||||||||
.str.to_datetime(strict=False) | ||||||||
martonvago marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
.first() | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. first is taking the first value? Is this to just get to first datetime in the column? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah! All this preamble is just to determine if the column is timezone-naive or timezone-aware. I'm not the biggest fan of it, so I'm very open to other ideas 😊 |
||||||||
) | ||||||||
has_timezone = bool(first_datetime.tzinfo) if first_datetime else False | ||||||||
default_datetime = "0001-01-01T00:00:00" + ("Z" if has_timezone else "") | ||||||||
datetime_format = "%Y-%m-%dT%H:%M:%S%.f" + ("%z" if has_timezone else "") | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm explicitly setting the format because otherwise it passes datetimes that are not in the format specified by Frictionless as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this line not add timezone if it isn't found? via the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah: |
||||||||
|
||||||||
return ( | ||||||||
pl.col(field_name).str.starts_with("-").not_() | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This is right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah! But I guess more "fail" than "drop", because their True/False mapping will still be there, it will just be False. |
||||||||
& pl.col(field_name) | ||||||||
.fill_null(default_datetime) | ||||||||
.str.replace("Z", "+00:00") | ||||||||
.str.to_datetime(time_unit="ms", format=datetime_format, strict=False) | ||||||||
.dt.convert_time_zone(time_zone="UTC") | ||||||||
.is_not_null() | ||||||||
) | ||||||||
|
||||||||
|
||||||||
def check_is_date(field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only date values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return ( | ||||||||
pl.col(field_name).str.starts_with("-").not_() | ||||||||
& pl.col(field_name) | ||||||||
.fill_null("0001-01-01") | ||||||||
.str.to_date(format="%Y-%m-%d", strict=False) | ||||||||
.is_not_null() | ||||||||
) | ||||||||
|
||||||||
|
||||||||
def check_is_time(field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only time values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return ( | ||||||||
pl.col(field_name) | ||||||||
.fill_null("00:00:00") | ||||||||
.str.to_time(format="%H:%M:%S%.f", strict=False) | ||||||||
.is_not_null() | ||||||||
) | ||||||||
|
||||||||
|
||||||||
# https://stackoverflow.com/a/18690202 | ||||||||
GEOPOINT_PATTERN = ( | ||||||||
r"^(?:[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)),\s*" | ||||||||
r"(?:[-+]?(?:180(?:\.0+)?|(?:1[0-7]\d|[1-9]?\d)(?:\.\d+)?))$" | ||||||||
) | ||||||||
Comment on lines
+142
to
+146
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! haha |
||||||||
|
||||||||
|
||||||||
def check_is_geopoint(field_name: str) -> pl.Expr: | ||||||||
"""Checks if the column contains only geopoint values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return pl.col(field_name).str.contains(GEOPOINT_PATTERN) | ||||||||
|
||||||||
|
||||||||
def check_is_json(field_name: str, expected_type: type[list | dict]) -> pl.Expr: | ||||||||
"""Checks if the column contains only JSON values. | ||||||||
|
||||||||
Failed values are marked with False. | ||||||||
|
||||||||
Warning: uses `map_elements` to check the formatting of each value and may run | ||||||||
slowly on large datasets. | ||||||||
Comment on lines
+168
to
+169
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't find a way of checking JSON using only Polars expressions. There is a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fine, we don't really need to support that for now. |
||||||||
|
||||||||
Args: | ||||||||
field_name: The name of the column to check. | ||||||||
expected_type: The expected JSON type: an object or an array. | ||||||||
|
||||||||
Returns: | ||||||||
A Polars expression for checking the column. | ||||||||
""" | ||||||||
return pl.col(field_name).map_elements( | ||||||||
lambda value: check_value_is_json(value, expected_type), | ||||||||
return_dtype=pl.Boolean, | ||||||||
) | ||||||||
|
||||||||
|
||||||||
def check_value_is_json(value: str, expected_type: type[list | dict]) -> bool: | ||||||||
"""Checks if the `value` is correctly formatted as a JSON object or array. | ||||||||
|
||||||||
Args: | ||||||||
value: The value to check. | ||||||||
expected_type: The expected JSON type: an object or an array. | ||||||||
|
||||||||
Returns: | ||||||||
True if the value is a correct JSON type, False otherwise. | ||||||||
""" | ||||||||
try: | ||||||||
return isinstance(json.loads(value), expected_type) | ||||||||
except json.JSONDecodeError: | ||||||||
return False | ||||||||
|
||||||||
|
||||||||
FRICTIONLESS_TO_COLUMN_CHECK = { | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Used here |
||||||||
"boolean": check_is_boolean, | ||||||||
"integer": lambda col_name: check_is_castable_type(col_name, "integer"), | ||||||||
"number": lambda col_name: check_is_castable_type(col_name, "number"), | ||||||||
"year": lambda col_name: check_is_castable_type(col_name, "year"), | ||||||||
"yearmonth": check_is_yearmonth, | ||||||||
"datetime": check_is_datetime, | ||||||||
"date": check_is_date, | ||||||||
"time": check_is_time, | ||||||||
"geopoint": check_is_geopoint, | ||||||||
"array": lambda value: check_is_json(value, list), | ||||||||
"object": lambda value: check_is_json(value, dict), | ||||||||
"geojson": lambda value: check_is_json(value, dict), | ||||||||
"string": lambda _: pl.lit(True), | ||||||||
"any": lambda _: pl.lit(True), | ||||||||
"duration": lambda _: pl.lit(True), | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should I spend time writing a Polars-compatible regex for Frictionless durations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No hahah |
||||||||
} |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these functions map incorrect values to False and correct values to True. This way, we can run all checks on all values without terminating early due to an error. The context for these functions will show that the mapping is added in another column.