-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add context info to output #206
base: main
Are you sure you want to change the base?
feat: add context info to output #206
Conversation
@@ -91,6 +106,7 @@ class DQRuleColSet: | |||
|
|||
columns: list[str] | |||
check_func: Callable | |||
name: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree having a name does make sense now since you can identify the actual test by other parameters in the output.
Please add a test for this. All existing tests skip the name.
@@ -0,0 +1,14 @@ | |||
from pyspark.sql.types import StructType, StructField, ArrayType, StringType, TimestampType | |||
|
|||
validation_result_schema = ArrayType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a way to provide a custom key-value metadata by the users (as dict).
I would propose an additional field, e.g. StructField("user_metadata", MapType(StringType(), StringType()), nullable=True) supplied using extra_params
StructType( | ||
[ | ||
StructField("name", StringType(), nullable=True), | ||
StructField("rule", StringType(), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think message
would be better than a rule
here. This is really the message from the checks. Rule could mean a check as well which is ambiguous.
name: str = "" | ||
criticality: str = Criticality.ERROR.value | ||
filter: str | None = None | ||
check_func_args: list[Any] = field(default_factory=list) | ||
check_func_kwargs: dict[str, Any] = field(default_factory=dict) | ||
|
||
def __post_init__(self): | ||
# take the name from the alias of the column expression if not provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also verify that the provided args and kwargs are actually correct. It should only be possible to create DQRule
if correct correct args/kwargs are provided for a given check_func
. Otherwise we risk that user code will break when applying checks. The validation should be done before checks are applied.
renamed result schema added guide on how to access quality checks output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the DQRule signature to replace the direct column expression with a check function (check_func) and introduces context information (such as run_time) in the output. Key changes include exporting the updated dq_result_schema, modifying DQRule and DQRuleColSet to use check_func along with added parameters (col_name, check_func_args, check_func_kwargs), and updating tests and demo code accordingly.
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/databricks/labs/dqx/schema | Exports and defines the updated dq_result_schema |
tests/resources/all_checks.yaml | Updates YAML check definitions |
src/databricks/labs/dqx/rule.py | Refactors DQRule to use check_func and incorporate context details |
src/databricks/labs/dqx/engine.py | Adjusts error/warning column creation using dq_result_schema and run_time |
tests/unit/* & demos/dqx_demo_library.py | Updates tests and demo usage to match the new DQRule signature |
Comments suppressed due to low confidence (1)
src/databricks/labs/dqx/rule.py:63
- [nitpick] Invoking _get_check() during post_init might execute the check function unnecessarily, potentially impacting performance or causing side effects if the function is resource-intensive. Consider deferring its evaluation until it is needed.
object.__setattr__(self, "name", self.name if self.name else "col_" + get_column_name(self._get_check()))
@pierre-monnet thank you for the PR. Very appreciated. There are a few issues and I will continue improving it. I hope that's fine for you. We have other PRs and I would like to get this done as soon as possible to avoid conflicts as this PR introduces a lot of changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors and extends the data quality checking functionality by updating the DQRule signature to include context details (such as run_time) and by changing how result schemas are handled. Key changes include:
- Introducing dq_result_schema and updating cast types in engine.py for consistent structured output.
- Refactoring build_checks_by_metadata to use deep copies of arguments and new naming conventions.
- Adjusting tests and demos to align with the new DQRule interface and context handling.
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/databricks/labs/dqx/schema/init.py | Exports the new dq_result_schema. |
tests/resources/all_checks.yaml | Updates the YAML checks configuration. |
src/databricks/labs/dqx/schema/dq_result_schema.py | Introduces the new schema used for error/warning output. |
src/databricks/labs/dqx/engine.py | Updates build_checks_by_metadata and _create_results_map to use the new schema and context (e.g. run_time). |
tests/unit/test_apply_checks.py | Revises tests to use the new DQRule signature and verifies the new dq_result_schema output. |
tests/conftest.py | Adjusts file path string quotes for consistency. |
tests/unit/test_build_rules.py | Updates rule construction and metadata building to work with the refactored DQRule interface. |
demos/dqx_demo_library.py | Updates examples to use the new check_func and col_name parameters. |
src/databricks/labs/dqx/rule.py | Refactors DQRule to use check_func along with contextual information and introduces _get_check for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors how data quality rules are defined and processed by updating the DQRule signature and related engine functionality. Key changes include replacing the legacy “check” field with “check_func” and its associated arguments, updating demo notebooks to use the new restart magic, and adjusting tests and YAML configurations to reflect these changes.
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/databricks/labs/dqx/schema/init.py | Exposes the new dq_result_schema via the module’s public interface |
tests/resources/all_checks.yaml | Updates the YAML checks definitions with no structural issues |
src/databricks/labs/dqx/schema/dq_result_schema.py | Introduces the dq_result_schema using PySpark types |
demos/dqx_demo_library.py | Adapts demos to the new DQRule signature and migration to %restart_python |
src/databricks/labs/dqx/rule.py | Refactors DQRule to use check_func and an internal _get_check method |
src/databricks/labs/dqx/engine.py | Updates build_checks_by_metadata and _create_results_map for new rule format |
demos/dqx_dlt_demo.py | Simplifies the restart command to %restart_python |
tests/unit/test_apply_checks.py | Adjusts tests to compare schema using the new dq_result_schema and fixed run_time |
tests/conftest.py, demos/dqx_demo_tool.py, tests/unit/test_build_rules.py | Minor quote style and expected output updates reflecting the core changes |
Comments suppressed due to low confidence (3)
src/databricks/labs/dqx/engine.py:254
- [nitpick] Ensure that using dq_result_schema for casting the empty errors and warnings columns is consistent across the codebase. Consider extracting this cast into a helper function to avoid duplication.
F.lit(None).cast(dq_result_schema).alias(self._column_names[ColumnArguments.ERRORS])
src/databricks/labs/dqx/engine.py:282
- Verify that using F.filter with a lambda function on an array of structs is fully supported in your target Spark version to avoid runtime issues.
m_col = F.filter(F.array(*check_cols), lambda v: v.getField("rule").isNotNull())
tests/unit/test_apply_checks.py:59
- [nitpick] Using ignore_nullable may mask potential schema mismatches. Confirm that schema nullability differences are acceptable for production use or consider stricter schema validation.
assert_df_equality(df, expected_df, ignore_nullable=True)
Changes
Linked issues
I've changed how DQRule works by updating its signature to get context details. This makes DQRule quite close to DQRules now.
Also, I had to disable a pylint rule in the tests around the date patch; I'm not really sure how to handle it differently.
Resolves #55
Tests