Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add context info to output #206

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

pierre-monnet
Copy link
Contributor

@pierre-monnet pierre-monnet commented Mar 8, 2025

Changes

Linked issues

I've changed how DQRule works by updating its signature to get context details. This makes DQRule quite close to DQRules now.
Also, I had to disable a pylint rule in the tests around the date patch; I'm not really sure how to handle it differently.

Resolves #55

Tests

  • manually tested
  • added unit tests
  • added integration tests

@pierre-monnet pierre-monnet requested a review from a team as a code owner March 8, 2025 21:52
@pierre-monnet pierre-monnet requested review from gergo-databricks and removed request for a team March 8, 2025 21:52
@@ -91,6 +106,7 @@ class DQRuleColSet:

columns: list[str]
check_func: Callable
name: str = ""
Copy link
Contributor

@mwojtyczka mwojtyczka Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having a name does make sense now since you can identify the actual test by other parameters in the output.
Please add a test for this. All existing tests skip the name.

@@ -0,0 +1,14 @@
from pyspark.sql.types import StructType, StructField, ArrayType, StringType, TimestampType

validation_result_schema = ArrayType(
Copy link
Contributor

@mwojtyczka mwojtyczka Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a way to provide a custom key-value metadata by the users (as dict).
I would propose an additional field, e.g. StructField("user_metadata", MapType(StringType(), StringType()), nullable=True) supplied using extra_params

StructType(
[
StructField("name", StringType(), nullable=True),
StructField("rule", StringType(), nullable=True),
Copy link
Contributor

@mwojtyczka mwojtyczka Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think message would be better than a rule here. This is really the message from the checks. Rule could mean a check as well which is ambiguous.

name: str = ""
criticality: str = Criticality.ERROR.value
filter: str | None = None
check_func_args: list[Any] = field(default_factory=list)
check_func_kwargs: dict[str, Any] = field(default_factory=dict)

def __post_init__(self):
# take the name from the alias of the column expression if not provided
Copy link
Contributor

@mwojtyczka mwojtyczka Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also verify that the provided args and kwargs are actually correct. It should only be possible to create DQRule if correct correct args/kwargs are provided for a given check_func. Otherwise we risk that user code will break when applying checks. The validation should be done before checks are applied.

renamed result schema
added guide on how to access quality checks output
@alexott alexott requested a review from Copilot March 12, 2025 11:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the DQRule signature to replace the direct column expression with a check function (check_func) and introduces context information (such as run_time) in the output. Key changes include exporting the updated dq_result_schema, modifying DQRule and DQRuleColSet to use check_func along with added parameters (col_name, check_func_args, check_func_kwargs), and updating tests and demo code accordingly.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/databricks/labs/dqx/schema Exports and defines the updated dq_result_schema
tests/resources/all_checks.yaml Updates YAML check definitions
src/databricks/labs/dqx/rule.py Refactors DQRule to use check_func and incorporate context details
src/databricks/labs/dqx/engine.py Adjusts error/warning column creation using dq_result_schema and run_time
tests/unit/* & demos/dqx_demo_library.py Updates tests and demo usage to match the new DQRule signature
Comments suppressed due to low confidence (1)

src/databricks/labs/dqx/rule.py:63

  • [nitpick] Invoking _get_check() during post_init might execute the check function unnecessarily, potentially impacting performance or causing side effects if the function is resource-intensive. Consider deferring its evaluation until it is needed.
object.__setattr__(self, "name", self.name if self.name else "col_" + get_column_name(self._get_check()))

@mwojtyczka
Copy link
Contributor

mwojtyczka commented Mar 12, 2025

@pierre-monnet thank you for the PR. Very appreciated. There are a few issues and I will continue improving it. I hope that's fine for you. We have other PRs and I would like to get this done as soon as possible to avoid conflicts as this PR introduces a lot of changes.

@mwojtyczka mwojtyczka requested a review from alexott March 12, 2025 12:04
@mwojtyczka mwojtyczka requested a review from Copilot March 12, 2025 13:15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and extends the data quality checking functionality by updating the DQRule signature to include context details (such as run_time) and by changing how result schemas are handled. Key changes include:

  • Introducing dq_result_schema and updating cast types in engine.py for consistent structured output.
  • Refactoring build_checks_by_metadata to use deep copies of arguments and new naming conventions.
  • Adjusting tests and demos to align with the new DQRule interface and context handling.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/databricks/labs/dqx/schema/init.py Exports the new dq_result_schema.
tests/resources/all_checks.yaml Updates the YAML checks configuration.
src/databricks/labs/dqx/schema/dq_result_schema.py Introduces the new schema used for error/warning output.
src/databricks/labs/dqx/engine.py Updates build_checks_by_metadata and _create_results_map to use the new schema and context (e.g. run_time).
tests/unit/test_apply_checks.py Revises tests to use the new DQRule signature and verifies the new dq_result_schema output.
tests/conftest.py Adjusts file path string quotes for consistency.
tests/unit/test_build_rules.py Updates rule construction and metadata building to work with the refactored DQRule interface.
demos/dqx_demo_library.py Updates examples to use the new check_func and col_name parameters.
src/databricks/labs/dqx/rule.py Refactors DQRule to use check_func along with contextual information and introduces _get_check for consistency.
@mwojtyczka mwojtyczka requested a review from Copilot March 12, 2025 14:00

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors how data quality rules are defined and processed by updating the DQRule signature and related engine functionality. Key changes include replacing the legacy “check” field with “check_func” and its associated arguments, updating demo notebooks to use the new restart magic, and adjusting tests and YAML configurations to reflect these changes.

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/databricks/labs/dqx/schema/init.py Exposes the new dq_result_schema via the module’s public interface
tests/resources/all_checks.yaml Updates the YAML checks definitions with no structural issues
src/databricks/labs/dqx/schema/dq_result_schema.py Introduces the dq_result_schema using PySpark types
demos/dqx_demo_library.py Adapts demos to the new DQRule signature and migration to %restart_python
src/databricks/labs/dqx/rule.py Refactors DQRule to use check_func and an internal _get_check method
src/databricks/labs/dqx/engine.py Updates build_checks_by_metadata and _create_results_map for new rule format
demos/dqx_dlt_demo.py Simplifies the restart command to %restart_python
tests/unit/test_apply_checks.py Adjusts tests to compare schema using the new dq_result_schema and fixed run_time
tests/conftest.py, demos/dqx_demo_tool.py, tests/unit/test_build_rules.py Minor quote style and expected output updates reflecting the core changes
Comments suppressed due to low confidence (3)

src/databricks/labs/dqx/engine.py:254

  • [nitpick] Ensure that using dq_result_schema for casting the empty errors and warnings columns is consistent across the codebase. Consider extracting this cast into a helper function to avoid duplication.
F.lit(None).cast(dq_result_schema).alias(self._column_names[ColumnArguments.ERRORS])

src/databricks/labs/dqx/engine.py:282

  • Verify that using F.filter with a lambda function on an array of structs is fully supported in your target Spark version to avoid runtime issues.
m_col = F.filter(F.array(*check_cols), lambda v: v.getField("rule").isNotNull())

tests/unit/test_apply_checks.py:59

  • [nitpick] Using ignore_nullable may mask potential schema mismatches. Confirm that schema nullability differences are acceptable for production use or consider stricter schema validation.
assert_df_equality(df, expected_df, ignore_nullable=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add context info to the output
2 participants