-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: ♻️ check column data types #1111
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these functions map incorrect values to False and correct values to True. This way, we can run all checks on all values without terminating early due to an error. The context for these functions will show that the mapping is added in another column.
""" | ||
return ( | ||
pl.col(field_name) | ||
.cast(FRICTIONLESS_TO_POLARS[field_type], strict=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below strict=False
means: map to None
if the operation fails instead of throwing an error.
.first() | ||
) | ||
has_timezone = bool(first_datetime.tzinfo) if first_datetime else False | ||
datetime_format = "%Y-%m-%dT%H:%M:%S%.f" + ("%z" if has_timezone else "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm explicitly setting the format because otherwise it passes datetimes that are not in the format specified by Frictionless as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this line not add timezone if it isn't found? via the else ""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah:
If there is no timezone: %Y-%m-%dT%H:%M:%S%.f
If there is a timezone: %Y-%m-%dT%H:%M:%S%.f%z
Mixing values with and without timezone information is not allowed. | ||
Mixing values with different timezones is allowed, as they will be standardised | ||
before saving to Parquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we handle these cases is very much up for discussion.
Right now:
- Check passes if all datetimes have timezones but the timezones are not all the same. We have to convert them to the same timezone before saving to Parquet.
- Check fails if timezone-aware and timezone-naive datetimes are mixed. The rationale is that we just don't know what timezone timezone-naive datetimes are in. Alternatively, we could pass in this case as well, and force all datetimes to the same timezone regardless before saving to Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is tricky. I think failing if there is mixing is a good idea. What about if there is no timezone information at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all datetimes are timezone-naive (= no timezone given) that's okay, this check will pass. Parquet is also happy with timezone-naive datetimes. You just cannot mix timezone-naive and timezone-aware datetimes.
Warning: uses `map_elements` to check the formatting of each value and may run | ||
slowly on large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a way of checking JSON using only Polars expressions. There is a json_decode
expression, but it has no strict=False
mode, so if it encounters an error it will exit the expression chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, we don't really need to support that for now.
return False | ||
|
||
|
||
FRICTIONLESS_TO_COLUMN_CHECK = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used here
"geojson": lambda value: check_is_json(value, dict), | ||
"string": lambda _: pl.lit(True), | ||
"any": lambda _: pl.lit(True), | ||
"duration": lambda _: pl.lit(True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I spend time writing a Polars-compatible regex for Frictionless durations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No hahah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 just some comments ⭐
def check_is_boolean(field_name: str) -> pl.Expr: | ||
"""Checks if the column contains only boolean values. | ||
|
||
Failed values are marked with False. | ||
|
||
Args: | ||
field_name: The name of the column to check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to be consistent with the language (and/or clarify what each means). There's both field and column being mentioned. Maybe you can say "check if field (also called column)..."? Either way, we should be consistent here and also explicit what field
means in these cases. I prefer column
because that's clearer, but that's only applicable to tabular data. Which is fine because the code uses Polars which needs a data frame. So maybe rename everything to column
rather than field
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to use column
! field
is from Frictionless' terminology, but I don't think we have to keep to that here.
return pl.col(field_name).is_in(BOOLEAN_VALUES) | ||
|
||
|
||
def check_is_castable_type(field_name: str, field_type: FieldType) -> pl.Expr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def check_is_castable_type(field_name: str, field_type: FieldType) -> pl.Expr: | |
def check_is_given_type(field_name: str, field_type: FieldType) -> pl.Expr: |
castable isn't super clear what it means. Maybe "given"? But I don't love that either ... 🤔 Or just check_is_field_type()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check_is_type_by_conversion
? Because the point here is that the only "check" is trying to cast the value to the target data type.
src/seedcase_sprout/core/sprout_checks/check_column_data_types.py
Outdated
Show resolved
Hide resolved
src/seedcase_sprout/core/sprout_checks/check_column_data_types.py
Outdated
Show resolved
Hide resolved
Mixing values with and without timezone information is not allowed. | ||
Mixing values with different timezones is allowed, as they will be standardised | ||
before saving to Parquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is tricky. I think failing if there is mixing is a good idea. What about if there is no timezone information at all?
# https://stackoverflow.com/a/18690202 | ||
GEOPOINT_PATTERN = ( | ||
r"^(?:[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)),\s*" | ||
r"(?:[-+]?(?:180(?:\.0+)?|(?:1[0-7]\d|[1-9]?\d)(?:\.\d+)?))$" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! haha
Warning: uses `map_elements` to check the formatting of each value and may run | ||
slowly on large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, we don't really need to support that for now.
"geojson": lambda value: check_is_json(value, dict), | ||
"string": lambda _: pl.lit(True), | ||
"any": lambda _: pl.lit(True), | ||
"duration": lambda _: pl.lit(True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No hahah
|
||
# Datetime | ||
DATETIME_BAD_VALUES_WHEN_TIMEZONE = [ | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when the first value is a correct timezone value? But the rest aren't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the check will fail. We're using the first value only to deduce if there is timezone information or not, then we check each value in the column.
So, in this case the column is timezone-aware, so we will fail any value that is timezone-naive or just a badly formatted datetime.
] | ||
|
||
DATETIME_BAD_VALUES_WHEN_NO_TIMEZONE = [ | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. Also, what if the first value has a timezone, but the rest don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first value has no timezone information (= timezone-naive), then we will expect the rest of the values not to have a timezone either and check accordingly. So we will fail any value that is timezone-aware or just a badly formatted datetime.
As above, if the first value does have timezone information (= timezone-aware), then we will expect the rest of the values to have a timezone as well. So we will fail any value that is timezone-naive or just a badly formatted datetime.
Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
Co-authored-by: Luke W. Johnston <lwjohnst86@users.noreply.github.com>
Description
Stacked PR [1/2]
This PR adds functions for checking the data type of columns in Polars data frames. More precisely, these functions check if all string values in a column can be converted to their target (Polars and Parquet) data type. They replace Pandera checks, so I deleted those.
The functions make more sense as part of the full check pipeline -- in the next PR.
This PR needs an in-depth review.
Checklist
just run-all