Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: ♻️ check column data types #1111

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 37 additions & 143 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@ requests = "^2.32.3"
platformdirs = "^4.3.2"
jsonschema = "^4.23.0"
dacite = "^1.8.1"
xmlschema = "^3.4.3"
pandera = {extras = ["polars"], version = "^0.22.1"}

[tool.poetry.group.test.dependencies]
pytest = "^8.3.2"
Expand Down
207 changes: 207 additions & 0 deletions src/seedcase_sprout/core/sprout_checks/check_column_data_types.py
Copy link
Contributor Author

@martonvago martonvago Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these functions map incorrect values to False and correct values to True. This way, we can run all checks on all values without terminating early due to an error. The context for these functions will show that the mapping is added in another column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename it to column_data_types?

Might be good to rename all field into column below (See some of my other comments about it below). To be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From just data types? I thought that the function checking all data types in all columns in the data frame could be called check_data_types instead. And these functions here are very specifically about checking an entire column in any case.

Good point about calling it one thing!

Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
import json

import polars as pl

from seedcase_sprout.core.map_data_types import FRICTIONLESS_TO_POLARS
from seedcase_sprout.core.properties import FieldType

# https://datapackage.org/standard/table-schema/#boolean
BOOLEAN_VALUES = {"false", "False", "FALSE", "0", "true", "True", "TRUE", "1"}


def check_is_boolean(field_name: str) -> pl.Expr:
"""Checks if the column contains only boolean values.

Failed values are marked with False.

Args:
field_name: The name of the column to check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to be consistent with the language (and/or clarify what each means). There's both field and column being mentioned. Maybe you can say "check if field (also called column)..."? Either way, we should be consistent here and also explicit what field means in these cases. I prefer column because that's clearer, but that's only applicable to tabular data. Which is fine because the code uses Polars which needs a data frame. So maybe rename everything to column rather than field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to use column! field is from Frictionless' terminology, but I don't think we have to keep to that here.


Returns:
A Polars expression for checking the column.
"""
return pl.col(field_name).is_in(BOOLEAN_VALUES)


def check_is_castable_type(field_name: str, field_type: FieldType) -> pl.Expr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def check_is_castable_type(field_name: str, field_type: FieldType) -> pl.Expr:
def check_is_given_type(field_name: str, field_type: FieldType) -> pl.Expr:

castable isn't super clear what it means. Maybe "given"? But I don't love that either ... 🤔 Or just check_is_field_type()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check_is_type_by_conversion? Because the point here is that the only "check" is trying to cast the value to the target data type.

"""Checks if the column contains only values of the given type.

The check is done by attempting to cast to the appropriate Polars data type.
Failed values are marked with False.

Args:
field_name: The name of the column to check.
field_type: The type of the field.

Returns:
A Polars expression for checking the column.
"""
return (
pl.col(field_name)
.cast(FRICTIONLESS_TO_POLARS[field_type], strict=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below strict=False means: map to None if the operation fails instead of throwing an error.

.is_not_null()
)


def check_is_yearmonth(field_name: str) -> pl.Expr:
"""Checks if the column contains only yearmonth values.

Failed values are marked with False.

Args:
field_name: The name of the column to check.

Returns:
A Polars expression for checking the column.
"""
return (
pl.col(field_name).str.starts_with("-").not_()
& (pl.col(field_name) + "-01")
.str.to_date(format="%Y-%m-%d", strict=False)
.is_not_null()
)


def check_is_datetime(data_frame: pl.DataFrame, field_name: str) -> pl.Expr:
"""Checks if the column contains only datetime values.

Mixing values with and without timezone information is not allowed.
Mixing values with different timezones is allowed, as they will be standardised
before saving to Parquet.
Comment on lines +72 to +74
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we handle these cases is very much up for discussion.
Right now:

  • Check passes if all datetimes have timezones but the timezones are not all the same. We have to convert them to the same timezone before saving to Parquet.
  • Check fails if timezone-aware and timezone-naive datetimes are mixed. The rationale is that we just don't know what timezone timezone-naive datetimes are in. Alternatively, we could pass in this case as well, and force all datetimes to the same timezone regardless before saving to Parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is tricky. I think failing if there is mixing is a good idea. What about if there is no timezone information at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all datetimes are timezone-naive (= no timezone given) that's okay, this check will pass. Parquet is also happy with timezone-naive datetimes. You just cannot mix timezone-naive and timezone-aware datetimes.


Failed values are marked with False.

Args:
data_frame: The data frame being operated on.
field_name: The name of the column to check.

Returns:
A Polars expression for checking the column.
"""
first_datetime = (
data_frame.get_column(field_name)
.drop_nulls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why drop nulls at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if the column is, e.g., null, null, 2002-10-10T12:00:00+01:00, 2002-10-10T12:00:00+01:00, ..., then checking the first value for timezone information will give None (because it is null), and has_timezone will come out False. But here all non-null values are actually timezone-aware, so we want has_timezone to be True.
(This has no effect on the original data frame.)

.str.to_datetime(strict=False)
.first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first is taking the first value? Is this to just get to first datetime in the column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! All this preamble is just to determine if the column is timezone-naive or timezone-aware. I'm not the biggest fan of it, so I'm very open to other ideas 😊

)
has_timezone = bool(first_datetime.tzinfo) if first_datetime else False
datetime_format = "%Y-%m-%dT%H:%M:%S%.f" + ("%z" if has_timezone else "")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm explicitly setting the format because otherwise it passes datetimes that are not in the format specified by Frictionless as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line not add timezone if it isn't found? via the else ""?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah:
If there is no timezone: %Y-%m-%dT%H:%M:%S%.f
If there is a timezone: %Y-%m-%dT%H:%M:%S%.f%z


return (
pl.col(field_name).str.starts_with("-").not_()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pl.col(field_name).str.starts_with("-").not_()
# Drop field values that start with `-` negative.
pl.col(field_name).str.starts_with("-").not_()

This is right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! But I guess more "fail" than "drop", because their True/False mapping will still be there, it will just be False.

& pl.col(field_name)
.str.replace("Z", "+00:00")
.str.to_datetime(time_unit="ms", format=datetime_format, strict=False)
.dt.convert_time_zone(time_zone="UTC")
.is_not_null()
)


def check_is_date(field_name: str) -> pl.Expr:
"""Checks if the column contains only date values.

Failed values are marked with False.

Args:
field_name: The name of the column to check.

Returns:
A Polars expression for checking the column.
"""
return (
pl.col(field_name).str.starts_with("-").not_()
& pl.col(field_name).str.to_date(format="%Y-%m-%d", strict=False).is_not_null()
)


def check_is_time(field_name: str) -> pl.Expr:
"""Checks if the column contains only time values.

Failed values are marked with False.

Args:
field_name: The name of the column to check.

Returns:
A Polars expression for checking the column.
"""
return (
pl.col(field_name).str.to_time(format="%H:%M:%S%.f", strict=False).is_not_null()
)


# https://stackoverflow.com/a/18690202
GEOPOINT_PATTERN = (
r"^(?:[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)),\s*"
r"(?:[-+]?(?:180(?:\.0+)?|(?:1[0-7]\d|[1-9]?\d)(?:\.\d+)?))$"
)
Comment on lines +142 to +146
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! haha



def check_is_geopoint(field_name: str) -> pl.Expr:
"""Checks if the column contains only geopoint values.

Failed values are marked with False.

Args:
field_name: The name of the column to check.

Returns:
A Polars expression for checking the column.
"""
return pl.col(field_name).str.contains(GEOPOINT_PATTERN)


def check_is_json(field_name: str, expected_type: type[list | dict]) -> pl.Expr:
"""Checks if the column contains only JSON values.

Failed values are marked with False.

Warning: uses `map_elements` to check the formatting of each value and may run
slowly on large datasets.
Comment on lines +168 to +169
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way of checking JSON using only Polars expressions. There is a json_decode expression, but it has no strict=False mode, so if it encounters an error it will exit the expression chain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, we don't really need to support that for now.


Args:
field_name: The name of the column to check.
expected_type: The expected JSON type: an object or an array.

Returns:
A Polars expression for checking the column.
"""
return pl.col(field_name).map_elements(
lambda value: check_value_is_json(value, expected_type),
return_dtype=pl.Boolean,
)


def check_value_is_json(value: str, expected_type: type[list | dict]) -> bool:
"""Checks if the `value` is correctly formatted as a JSON object or array.

Args:
value: The value to check.
expected_type: The expected JSON type: an object or an array.

Returns:
True if the value is a correct JSON type, False otherwise.
"""
try:
return isinstance(json.loads(value), expected_type)
except json.JSONDecodeError:
return False


FRICTIONLESS_TO_COLUMN_CHECK = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used here

"boolean": check_is_boolean,
"integer": lambda col_name: check_is_castable_type(col_name, "integer"),
"number": lambda col_name: check_is_castable_type(col_name, "number"),
"year": lambda col_name: check_is_castable_type(col_name, "year"),
"yearmonth": check_is_yearmonth,
"datetime": check_is_datetime,
"date": check_is_date,
"time": check_is_time,
"geopoint": check_is_geopoint,
"array": lambda value: check_is_json(value, list),
"object": lambda value: check_is_json(value, dict),
"geojson": lambda value: check_is_json(value, dict),
"string": lambda _: pl.lit(True),
"any": lambda _: pl.lit(True),
"duration": lambda _: pl.lit(True),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I spend time writing a Polars-compatible regex for Frictionless durations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No hahah

}
94 changes: 0 additions & 94 deletions src/seedcase_sprout/core/sprout_checks/check_data_types.py

This file was deleted.

Loading