Skip to content

Commit ed300d9

Browse files
committed
feat: refactored tests and added tests
2 parents 0d99710 + 27bc038 commit ed300d9

28 files changed

+1154
-551
lines changed

demos/dqx_demo_library.py

+2
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@
104104
""")
105105

106106
dq_engine = DQEngine(WorkspaceClient())
107+
107108
status = dq_engine.validate_checks(checks)
108109
print(status.has_errors)
109110
print(status.errors)
@@ -334,5 +335,6 @@ def ends_with_foo(col_name: str) -> Column:
334335
input_df = spark.createDataFrame([["str1"], ["foo"], ["str3"]], schema)
335336

336337
dq_engine = DQEngine(WorkspaceClient())
338+
337339
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks, globals())
338340
display(valid_and_quarantined_df)

demos/dqx_demo_tool.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@
8484

8585
ws = WorkspaceClient()
8686
dq_engine = DQEngine(ws)
87-
run_config = dq_engine.load_run_config(run_config="default", assume_user=True)
87+
run_config = dq_engine.load_run_config(run_config_name="default", assume_user=True)
8888

8989
# read the input data, limit to 1000 rows for demo purpose
9090
input_df = spark.read.format(run_config.input_format).load(run_config.input_location).limit(1000)
@@ -101,14 +101,14 @@
101101
print(yaml.safe_dump(checks))
102102

103103
# save generated checks to location specified in the default run configuration inside workspace installation folder
104-
dq_engine.save_checks(checks, run_config_name="default")
104+
dq_engine.save_checks_in_installation(checks, run_config_name="default")
105105
# or save it to an arbitrary workspace location
106106
#dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
107107

108108
# COMMAND ----------
109109

110110
# MAGIC %md
111-
# MAGIC ### Prepare checks manually (optional)
111+
# MAGIC ### Prepare checks manually and save in the workspace (optional)
112112
# MAGIC
113113
# MAGIC You can modify the check candidates generated by the profiler to suit your needs. Alternatively, you can create checks manually, as demonstrated below, without using the profiler.
114114

@@ -161,7 +161,7 @@
161161

162162
dq_engine = DQEngine(WorkspaceClient())
163163
# save checks to location specified in the default run configuration inside workspace installation folder
164-
dq_engine.save_checks(checks, run_config_name="default")
164+
dq_engine.save_checks_in_installation(checks, run_config_name="default")
165165
# or save it to an arbitrary workspace location
166166
#dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
167167

@@ -175,7 +175,7 @@
175175
from databricks.labs.dqx.engine import DQEngine
176176
from databricks.sdk import WorkspaceClient
177177

178-
run_config = dq_engine.load_run_config(run_config="default", assume_user=True)
178+
run_config = dq_engine.load_run_config(run_config_name="default", assume_user=True)
179179

180180
# read the data, limit to 1000 rows for demo purpose
181181
bronze_df = spark.read.format(run_config.input_format).load(run_config.input_location).limit(1000)
@@ -186,7 +186,7 @@
186186
dq_engine = DQEngine(WorkspaceClient())
187187

188188
# load checks from location defined in the run configuration
189-
checks = dq_engine.load_checks(assume_user=True, run_config_name="default")
189+
checks = dq_engine.load_checks_from_installation(assume_user=True, run_config_name="default")
190190
# or load checks from arbitrary workspace file
191191
# checks = dq_engine.load_checks_from_workspace_file(workspace_path="/Shared/App1/checks.yml")
192192
print(checks)

docs/dqx/docs/demos.mdx

+2-3
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ sidebar_position: 4
44

55
# Demos
66

7-
After the [installation](/docs/installation) of the framework,
8-
you can import the following notebooks in the Databricks workspace to try it out:
7+
Install the [installation](/docs/installation) framework, and import the following notebooks in the Databricks workspace to try it out:
98
* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
10-
* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_tool.py) - demonstrates how to use DQX when installed in the workspace, including usage of DQX dashboards.
9+
* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
1110
* [DQX DLT Demo Notebook](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Delta Live Tables (DLT).

docs/dqx/docs/dev/contributing.mdx

+8-7
Original file line numberDiff line numberDiff line change
@@ -93,12 +93,9 @@ make lint
9393
make test
9494
```
9595

96-
Configure auth to Databricks workspace for integration testing by configuring credentials.
97-
98-
If you want to run the tests from an IDE you must setup `.env` or `~/.databricks/debug-env.json` file
99-
(see [instructions](https://github.com/databrickslabs/pytester?tab=readme-ov-file#debug_env_name-fixture)).
100-
101-
Setup required environment variables for executing integration tests and code coverage:
96+
Setup required environment variables for executing integration tests and code coverage using the command line.
97+
Note that integration tests are run automatically when you create a Pull Request in Github.
98+
You can also run them from a local machine by configuring authentication to a Databricks workspace as below:
10299
```shell
103100
export DATABRICKS_HOST=https://<workspace-url>
104101
export DATABRICKS_CLUSTER_ID=<cluster-id>
@@ -119,9 +116,13 @@ Calculate test coverage and display report in html:
119116
make coverage
120117
```
121118

119+
If you want to be able to run integration tests from your IDE, you must setup `.env` or `~/.databricks/debug-env.json` file
120+
(see [instructions](https://github.com/databrickslabs/pytester?tab=readme-ov-file#debug_env_name-fixture)).
121+
The name of the debug environment that you define must be `ws`.
122+
122123
## Running CLI from the local repo
123124

124-
Once you clone the repo locally and install Databricks CLI you can run labs CLI commands.
125+
Once you clone the repo locally and install Databricks CLI you can run labs CLI commands from the root of the repository.
125126
Similar to other databricks cli commands we can specify profile to use with `--profile`.
126127

127128
Authenticate your current machine to your Databricks Workspace:

docs/dqx/docs/guide.mdx

+57-24
Original file line numberDiff line numberDiff line change
@@ -24,22 +24,23 @@ from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
2424
from databricks.labs.dqx.engine import DQEngine
2525
from databricks.sdk import WorkspaceClient
2626

27-
df = spark.read.table("catalog1.schema1.table1")
27+
input_df = spark.read.table("catalog1.schema1.table1")
2828

29+
# profile input data
2930
ws = WorkspaceClient()
3031
profiler = DQProfiler(ws)
31-
summary_stats, profiles = profiler.profile(df)
32+
summary_stats, profiles = profiler.profile(input_df)
3233

3334
# generate DQX quality rules/checks
3435
generator = DQGenerator(ws)
3536
checks = generator.generate_dq_rules(profiles) # with default level "error"
3637

37-
# save checks in the workspace
3838
dq_engine = DQEngine(ws)
39-
# in arbitrary workspace location
39+
40+
# save checks in arbitrary workspace location
4041
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
41-
# in workspace location specified in the run config (only works if DQX is installed in the workspace)
42-
dq_engine.save_checks(checks, run_config_name="default")
42+
# save checks in the installation folder specified in the default run config (only works if DQX is installed in the workspace)
43+
dq_engine.save_checks_in_installation(checks, run_config_name="default")
4344

4445
# generate DLT expectations
4546
dlt_generator = DQDltGenerator(ws)
@@ -153,9 +154,9 @@ Fields:
153154
- `check`: column expression containing "function" (check function to apply), "arguments" (check function arguments), and "col_name" (column name as `str` the check will be applied for) or "col_names" (column names as `array` the check will be applied for).
154155
- (optional) `name` for the check: autogenerated if not provided.
155156

156-
#### Loading and execution methods
157+
### Loading and execution methods
157158

158-
**Method 1: load checks from a workspace file in the installation folder**
159+
#### Method 1: Loading checks from a workspace file in the installation folder
159160

160161
If DQX is installed in the workspace, you can load checks based on the run configuration:
161162

@@ -164,9 +165,10 @@ from databricks.labs.dqx.engine import DQEngine
164165
from databricks.sdk import WorkspaceClient
165166
166167
dq_engine = DQEngine(WorkspaceClient())
167-
168168
# load check file specified in the run configuration
169-
checks = dq_engine.load_checks(assume_user=True, run_config_name="default")
169+
checks = dq_engine.load_checks_from_installation(assume_user=True, run_config_name="default")
170+
171+
input_df = spark.read.table("catalog1.schema1.table1")
170172
171173
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
172174
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks)
@@ -175,9 +177,7 @@ valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df
175177
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks)
176178
```
177179

178-
Checks are validated automatically as part of the `apply_checks_by_metadata_and_split` and `apply_checks_by_metadata` methods.
179-
180-
**Method 2: load checks from a workspace file**
180+
#### Method 2: Loading checks from a workspace file
181181

182182
The checks can also be loaded from any file in the Databricks workspace:
183183

@@ -188,6 +188,8 @@ from databricks.sdk import WorkspaceClient
188188
dq_engine = DQEngine(WorkspaceClient())
189189
checks = dq_engine.load_checks_from_workspace_file(workspace_path="/Shared/App1/checks.yml")
190190

191+
input_df = spark.read.table("catalog1.schema1.table1")
192+
191193
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
192194
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks)
193195

@@ -197,7 +199,7 @@ valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks)
197199

198200
Checks are validated automatically as part of the `apply_checks_by_metadata_and_split` and `apply_checks_by_metadata` methods.
199201

200-
**Method 3: load checks from a local file**
202+
#### Method 3: Loading checks from a local file
201203

202204
Checks can also be loaded from a file in the local file system:
203205

@@ -208,6 +210,8 @@ from databricks.sdk import WorkspaceClient
208210
checks = DQEngine.load_checks_from_local_file("checks.yml")
209211
dq_engine = DQEngine(WorkspaceClient())
210212

213+
input_df = spark.read.table("catalog1.schema1.table1")
214+
211215
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
212216
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks)
213217

@@ -217,13 +221,15 @@ valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks)
217221

218222
### Quality rules defined as code
219223

220-
**Method 1: using DQX classes**
224+
#### Method 1: Using DQX classes
221225

222226
```python
223227
from databricks.labs.dqx.col_functions import is_not_null, is_not_null_and_not_empty, value_is_in_list
224-
from databricks.labs.dqx.engine import DQEngine, DQRuleColSet, DQRule
228+
from databricks.labs.dqx.engine import DQEngine
229+
from databricks.labs.dqx.rule import DQRuleColSet, DQRule
225230
from databricks.sdk import WorkspaceClient
226231

232+
227233
dq_engine = DQEngine(WorkspaceClient())
228234

229235
checks = DQRuleColSet( # define rule for multiple columns at once
@@ -239,16 +245,18 @@ checks = DQRuleColSet( # define rule for multiple columns at once
239245
check=value_is_in_list('col4', ['1', '2']))
240246
]
241247

248+
input_df = spark.read.table("catalog1.schema1.table1")
249+
242250
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
243251
valid_df, quarantined_df = dq_engine.apply_checks_and_split(input_df, checks)
244252

245253
# Option 2: apply quality rules on the dataframe and report issues as additional columns (`_warning` and `_error`)
246254
valid_and_quarantined_df = dq_engine.apply_checks(input_df, checks)
247255
```
248256

249-
See details of the check functions [here](/docs/reference#quality-rules--functions).
257+
See details of the check functions [here](/docs/reference#quality-rules).
250258

251-
**Method 2: using yaml config**
259+
#### Method 2: Using yaml config
252260

253261
```python
254262
import yaml
@@ -282,22 +290,24 @@ checks = yaml.safe_load("""
282290
- 2
283291
""")
284292

293+
input_df = spark.read.table("catalog1.schema1.table1")
294+
285295
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
286296
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks)
287297

288298
# Option 2: apply quality rules on the dataframe and report issues as additional columns (`_warning` and `_error`)
289299
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks)
290300
```
291301

292-
See details of the check functions [here](/docs/reference/#quality-rules--functions).
302+
See details of the check functions [here](/docs/reference#quality-rules).
293303

294304
### Integration with DLT (Delta Live Tables)
295305

296306
DLT provides [expectations](https://docs.databricks.com/en/delta-live-tables/expectations.html) to enforce data quality constraints. However, expectations don't offer detailed insights into why certain checks fail.
297307
The example below demonstrates how to integrate DQX with DLT to provide comprehensive quality information.
298-
The DQX integration does not use expectations with DLT but DQX own methods.
308+
The DQX integration with DLT does not use DLT Expectations but DQX own methods.
299309

300-
**Option 1: apply quality rules and quarantine bad records**
310+
#### Option 1: Apply quality rules and quarantine bad records
301311

302312
```python
303313
import dlt
@@ -326,7 +336,7 @@ def quarantine():
326336
return dq_engine.get_invalid(df)
327337
```
328338

329-
**Option 2: apply quality rules as additional columns (`_warning` and `_error`)**
339+
#### Option 2: Apply quality rules and report issues as additional columns
330340

331341
```python
332342
import dlt
@@ -367,6 +377,29 @@ After executing the command:
367377
Note: the dashboards are only using the quarantined data as input as defined during the installation process.
368378
If you change the quarantine table in the run config after the deployment (`quarantine_table` field), you need to update the dashboard queries accordingly.
369379

370-
## Explore Quality Rules and Create Custom Checks
380+
## Quality Rules and Creation of Custom Checks
381+
382+
Discover the full list of available data quality rules and learn how to define your own custom checks in our [Reference](/docs/reference#quality-rules) section.
383+
384+
## Details on DQX Engine and Workspace Client
385+
386+
To perform data quality checking with DQX, you need to create `DQEngine` object.
387+
The engine requires a Databricks workspace client for authentication and interaction with the Databricks workspace.
388+
389+
When running the code on a Databricks workspace (e.g. in a notebook or as a job), the workspace client is automatically authenticated.
390+
For external environments (e.g. CI servers or local machines), you can authenticate using any method supported by the Databricks SDK. Detailed instructions are available in the [default authentication flow](https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html#default-authentication-flow).
391+
392+
If you use Databricks [configuration profiles](https://docs.databricks.com/dev-tools/auth.html#configuration-profiles) or Databricks-specific [environment variables](https://docs.databricks.com/dev-tools/auth.html#environment-variables) for authentication, you only need the following code to create a workspace client:
393+
```python
394+
from databricks.sdk import WorkspaceClient
395+
from databricks.labs.dqx.engine import DQEngine
396+
397+
ws = WorkspaceClient()
398+
399+
# use the workspace client to create the DQX engine
400+
dq_engine = DQEngine(ws)
401+
```
402+
403+
For details on the specific methods available in the engine, visit to the [reference](/docs/reference#dq-engine-methods) section.
371404

372-
Discover the full list of available data quality rules and learn how to define your own custom checks in our [Reference](/docs/reference) section.
405+
Information on testing applications that use `DQEngine` can be found [here](/docs/reference#testing-applications-using-dqx).

0 commit comments

Comments
 (0)