title |
---|
How to set up data quality monitoring and data observability for Google BigQuery |
Data observability and data monitoring for Google BigQuery. Detect schema changes, data anomalies, volume fluctuations, and other data quality issues.
BigQuery is a fully managed enterprise data warehouse from Google Cloud that helps manage and analyze data with built-in features like machine learning, geospatial analysis, and business intelligence.
To add BigQuery data source connection to DQOps you need the following:
- A BigQuery service account with BigQuery > BigQuery Job User permission
- A service account key in JSON format for JSON key authentication. For details refer to Create and delete service account keys
- A working Google Cloud CLI if you want to use Google Application Credentials authentication
To navigate to the BigQuery connection settings:
-
Go to the Data Sources section and click the + Add connection button in the upper left corner.
-
Select BiqQuery database type.
After navigating to the BigQuery connection settings, you will need to fill in its details.
BigQuery connection settings | Property name in YAML configuration file | Description |
---|---|---|
Connection name | The name of the connection that will be created in DQOps. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters, hyphens and underscore. | |
Source GCP project ID | source_project_id |
Name of the project that has datasets that will be imported. |
Authentication mode to the Google Cloud | authentication_mode |
Type of authentication mode to the Google Cloud. You can select from the 3 options: - Google Application Credentials, - JSON Key Content - JSON Key Path |
GCP project to create BigQuery jobs, where the authenticated principal has bigquery.jobs.create permission | jobs_create_project |
Google Cloud Platform project which will be used to create BigQuery jobs. In this project, the authenticated user must have bigquery.jobs.create permission. You can select from the 3 options: - Create jobs in source project - Create jobs in default project from credentials - Create jobs in selected billing project ID |
Billing GCP project ID | billing_project_id |
The ID of the selected billing GCP project. In this project, the authenticated user must have bigquery.jobs.create permission. This field is active when you select the "Create jobs in selected billing project ID" option. |
Quota GCP project ID | quota_project_id |
The Google Cloud Platform project ID which is used for invocation. |
DQOps allows you to dynamically replace properties in connection settings with environment variables. To use it, simply change "clear text" to ${ENV_VAR} using the drop-down menu at the end of the variable entry field and type your variable.
For example:
After filling in the connection settings, click the Test Connection button to test the connection.
Click the Save connection button when the test is successful otherwise, you can check the details of what went wrong.
-
Import the selected data resources (source schemas and tables) by clicking on the Import Tables button next to the name of the source schema from which you want to import tables.
-
Select the tables you want to import or import all tables using the buttons in the upper right corner.
Upon import, you will receive information that a new tables have been imported. You can then begin collecting basic statistics and profiling data by running default data profiling checks. Simply click on the Start profiling button to initiate this process.
!!! info "Automatically activated checks"
Once new tables are imported, DQOps automatically activates [profiling and monitoring checks](../dqo-concepts/definition-of-data-quality-checks/index.md) which are which are pre-enabled by [data quality policies](../dqo-concepts/data-observability.md#automatic-activation-of-checks).
These checks detect volume anomalies, data freshness anomalies, empty tables, table availability, schema changes, anomalies in the count of distinct values, and null percent anomalies. The profiling checks are scheduled
to run at 12:00 p.m. on the 1st day of every month, and the monitoring checks are scheduled to run daily at 12:00 p.m.
[**Profiling checks**](../dqo-concepts/definition-of-data-quality-checks/data-profiling-checks.md) are designed to assess
the initial data quality score of a data source. Profiling checks are also useful for exploring and experimenting with
various types of checks and determining the most suitable ones for regular data quality monitoring.
[**Monitoring checks**](../dqo-concepts/definition-of-data-quality-checks/data-observability-monitoring-checks.md) are
standard checks that monitor the data quality of a table or column. They can also be referred to as **Data Observability** checks.
These checks capture a single data quality result for the entire table or column.
To add a connection run the following command in DQOps Shell.
dqo> connection add
Fill in the data you will be asked for.
Connection name (--name): connection1
Database provider type (--provider):
[ 1] bigquery
[ 2] clickhouse
[ 3] databricks
[ 4] db2
[ 5] duckdb
[ 6] hana
[ 7] mariadb
[ 8] mysql
[ 9] oracle
[10] postgresql
[11] presto
[12] questdb
[13] redshift
[14] snowflake
[15] spark
[16] sqlserver
[17] teradata
[18] trino
Please enter one of the [] values: 1
Source GCP project ID (--bigquery-source-project-id") [dqo-ai-testing]: dqo-ai-testing
Billing GCP project ID (--bigquery-billing-project-id), leave null to use the default GCP project from credentials (dqo-ai-testing):
GCP Authentication Mode (--bigquery-authentication-mode) [google_application_credentials]:
[ 1] google_application_credentials (default)
[ 2] json_key_content
[ 3] json_key_path
Please enter one of the [] values:
GCP quota (billing) project ID (--bigquery-quota-project-id), leave blank to use the default GCP project from credentials (dqo-ai-testing):
Connection connection1 was successfully added.
Run 'table import -c=connection1' to import tables.
You can also run the command with parameters to add a connection in just a single step.
dqo> connection add --name=connection1
--provider=bigquery
--bigquery-source-project-id=bigquery-public-data
--bigquery-billing-project-id=dqo-ai-testing
--bigquery-quota-project-id=dqo-ai-testing
--bigquery-authentication-mode=google_application_credentials
After adding connection run table import -c=connection1
to select schemas and import tables.
DQOps will ask you to select the schema from which the tables will be imported.
You can also add the schema and table name as parameters to import tables in just a single step.
dqo> table import --connection={connection name}
--schema={schema name}
--table={table name}
DQOps supports the use of the asterisk character * as a wildcard when selecting schemas and tables, which can substitute any number of characters. For example, use pub* to find all schema a name with a name starting with "pub". The * character can be used at the beginning, middle, or end of the name.
Connection configurations are stored in the YAML files in the ./sources
folder. The name of the connection is also
the name of the folder where the configuration file is stored.
Below is a sample YAML file showing an example configuration of the BigQuery data source connection.
apiVersion: dqo/v1
kind: source
spec:
provider_type: bigquery
bigquery:
source_project_id: project1
jobs_create_project: create_jobs_in_default_project_from_credentials
billing_project_id: XXXXXXX
authentication_mode: google_application_credentials
quota_project_id: project
Complete documentation of all connection parameters used in the spec.bigquery
node is
described in the reference section of the BigQueryParametersSpec
YAML file format.
Application Default Credentials is a strategy used by the Google authentication libraries to automatically find credentials based on the application environment.
DQOps allows authentication using Google Application Credentials.
To provide your user credentials to DQOps, use the Google Cloud CLI:
- Install the gcloud CLI, if you haven't already.
- Run
gcloud auth application-default login
command in shell or command line to create your credential file. - Log in at login screen. The credentials will apply to all API calls that make use of the Application Default Credentials client library.
This method use the Authentication mode to the Google Cloud set as Google Application Credentials.
With DQOps, you can configure credentials to access GCP Cloud directly in the platform.
Please note, that any credentials and secrets shared with the DQOps Cloud or DQOps SaaS instances are stored in the .credentials folder. This folder also contains the default credentials file for GCP named GCP_application_default_credentials.json.
$DQO_USER_HOME
├───...
└───.credentials
├───GCP_application_default_credentials.json
└─...
If you wish to use JSON key authentication, the content of the file must be replaced with a service account key in JSON format. You can find more details on how to Create and delete service account keys in Google Cloud documentation.
!!! warning
If you do not replace the content of the file, the Application Default Credentials will be used.
To set the credential file in DQOps, follow these steps:
-
Open the Configuration in menu.
-
Select Shared credentials from the tree view on the left.
-
Click the edit link on the “GCP_application_credentials.json” file.
-
In the text area, paste the key in JSON format, replacing the placeholder text.
-
Click the Save button, to save changes.
!!! tip "Use the Application Default Credentials after filling in the shared credential"
If you still want to use default credentials from Google Cloud CLI,
you must manually delete the .credentials/GCP_application_default_credentials.json file from the DQOps credentials.
- We have provided a variety of use cases that use openly available datasets from Google Cloud to help you in using DQOps effectively. You can find the complete list of use cases here.
- DQOps allows you to keep track of the issues that arise during data quality monitoring and send alert notifications directly to Slack. Learn more about incidents and notifications.
- The data in the table often comes from different data sources and vendors or is loaded by different data pipelines. Learn how data grouping in DQOps can help you calculate separate data quality KPI scores for different groups of rows.