Skip to content

Commit 2c485d9

Browse files
Patryk Jatczakpiotrczarnas
Patryk Jatczak
authored andcommitted
Merged PR 2562: Getting started use CSV file as example
**IMPORTANT** Getting started use a github link to file which does not exist. Updating the github repository is required. Related work items: #11209
2 parents 99d2389 + 4ba6105 commit 2c485d9

File tree

5 files changed

+1137
-65
lines changed

5 files changed

+1137
-65
lines changed

docs/getting-started/add-data-source-connection.md

+61-38
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@ This guide shows how to connect a data source to DQOps, import the metadata, and
33

44
## Overview
55

6-
After [installation and starting DQOps](installation.md), we describe how to add a connection to [BigQuery public dataset Austin Crime Data](https://console.cloud.google.com/marketplace/details/city-of-austin/austin-crime)
7-
using the user interface.
6+
After [installation and starting DQOps](installation.md), we describe how to add a connection to a CSV file using the user interface.
7+
We present the example file used in this **guide**.
88

99
For a full description of how to add a data source connection to other providers or add connection using the command-line shell,
1010
see [Working with DQOps section](../data-sources/index.md).
@@ -17,69 +17,92 @@ You can find more information about [navigating the DQOps user interface here](.
1717

1818
Links to some supported data sources are shown below.
1919

20-
[![CSV](https://dqops.com/docs/images/connections/csv-icon2.png){ class=glightbox-ignored-image }](../data-sources/csv.md)
21-
      [![Parquet](https://dqops.com/docs/images/connections/parquet-icon2.png){ class=glightbox-ignored-image }](../data-sources/parquet.md)
20+
[![Parquet](https://dqops.com/docs/images/connections/parquet-icon2.png){ class=glightbox-ignored-image }](../data-sources/parquet.md)
2221
      [![Athena](https://dqops.com/docs/images/connections/athena2.png){ class=glightbox-ignored-image }](../data-sources/athena.md)
2322
      [![PostgreSQL](https://dqops.com/docs/images/connections/postgresql.png){ class=glightbox-ignored-image }](../data-sources/postgresql.md)
23+
      [![BigQuery](https://dqops.com/docs/images/connections/bigquery.png){ class=glightbox-ignored-image }](../data-sources/bigquery.md)
2424

25-
## Prerequisite credentials
25+
## CSV file
2626

27-
To add a connection to a BigQuery data source to DQOps you need the following:
27+
Choose a CSV file you want to analyse. To add a connection to a CSV file data source to DQOps you need one.
2828

29-
- A BiqQuery service account with **BigQuery > BigQuery Job User** permission. [You can create a free trial Google Cloud account here](https://cloud.google.com/free).
30-
- A service account key in JSON format for JSON key authentication. For details refer to [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete).
31-
- A working [Google Cloud CLI](https://cloud.google.com/sdk/docs/install) if you want to use [Google Application Credentials authentication](../data-sources/bigquery.md#using-google-application-credentials-authentication).
29+
You can also download a CSV file used in this guide.
30+
The table below presents the fragment of it's content.
3231

33-
We have chosen to use BigQuery data source for this getting started guide because public BigQuery datasets are freely available,
34-
and you can query them within the GCP FREE tier monthly limit.
32+
| unique_key | address | census_tract | clearance_date | clearance_status | council_district_code | description | district | latitude | longitude | location | location_description | primary_type | timestamp | x_coordinate | y_coordinate | year | zipcode |
33+
|------------|--------------------------------|--------------|--------------------------------|------------------|-----------------------|--------------------------|----------|----------|-----------|----------|----------------------|--------------|--------------------------------|--------------|--------------|------|---------|
34+
| 2015821204 | "1713 MULLEN DR Austin, TX" | | 2015-03-25 12:00:00.000000 UTC | Not cleared | | THEFT | UK | | | | 1713 MULLEN DR | Theft | 2015-03-23 12:00:00.000000 UTC | | | 2015 | |
35+
| 2015150483 | "Austin, TX" | | 2015-01-27 12:00:00.000000 UTC | Not cleared | | RAPE | B | | | | nan | Rape | 2015-01-15 12:00:00.000000 UTC | | | 2015 | |
36+
| 2015331540 | "5510 S IH 35 SVRD Austin, TX" | | 2015-02-11 12:00:00.000000 UTC | Not cleared | | BURGLARY OF VEHICLE | UK | | | | 5510 S IH 35 SVRD | Theft | 2015-02-02 12:00:00.000000 UTC | | | 2015 | |
37+
| 2015331238 | "7928 US HWY 71 W Austin, TX" | | 2015-02-12 12:00:00.000000 UTC | Not cleared | | THEFT OF HEAVY EQUIPMENT | UK | | | | 7928 US HWY 71 W | Theft | 2015-02-02 12:00:00.000000 UTC | | | 2015 | |
38+
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3539

36-
## Add BigQuery connection using the user interface
40+
The file is a sample of Austin Crime file from BigQuery public dataset Austin Crime Data.
41+
42+
### Downloading the example file
43+
44+
To download the example CSV file, [open the github webpage](https://github.com/dqops/dqo/blob/develop/dqops/sampledata/files/csv/austin_crime_sample/austin_crime.csv).
45+
46+
On the right side you can see the three dots button. When button is clicked the **download** becomes available on the expanded list.
47+
48+
![Adding connection](https://dqops.com/docs/images/getting-started/github-download.png)
49+
50+
Download the file austin_crime.csv and open a download directory containing the file.
51+
52+
To separate the downloaded file from other files we will not work with, put the file to a new folder.
53+
We created the new folder named __demo_files__ directly on a drive in this guide.
54+
55+
![Adding connection](https://dqops.com/docs/images/getting-started/file-explorer.png)
56+
57+
See, in our example the file is places in the directory: C:\demo_files\austin_crime.csv
58+
59+
Remember the absolute path to the file because you will use it when configuring the connection.
60+
61+
## Add CSV connection using the user interface
3762

3863
### **Navigate to the connection settings**
3964

40-
To navigate to the BigQuery connection settings:
65+
To navigate to the CSV connection settings:
4166

4267
1. Go to the **Data Sources** section and click **+ Add connection** button in the upper left corner.
4368

4469
![Adding connection](https://dqops.com/docs/images/working-with-dqo/adding-connections/adding-connection.png)
4570

46-
2. Select **BiqQuery** database type.
71+
2. Select **CSV** connection type.
4772

48-
![Selecting BigQuery database type](https://dqops.com/docs/images/working-with-dqo/adding-connections/adding-connection-bigquery.png)
73+
![Selecting CSV database type](https://dqops.com/docs/images/working-with-dqo/adding-connections/adding-connection-csv.png)
4974

5075
### **Fill in the connection settings**
5176

52-
After navigating to the BigQuery connection settings, you will need to fill in the connection details.
77+
After navigating to the CSV connection settings, you will need to fill in the connection details.
78+
79+
Focus on the required fields only. All other fields such as Virtual schema name **leave untouched**.
5380

54-
![Adding connection settings](https://dqops.com/docs/images/working-with-dqo/adding-connections/connection-settings-bigquery.png)
81+
![Adding connection](https://dqops.com/docs/images/getting-started/connection-settings-csv-filled1.png)
5582

56-
| BigQuery connection settings | Description |
57-
|------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
58-
| Connection name | The name of the connection that will be created in DQO. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters, hyphens and underscore. For example, "**testconnection**" |
59-
| Source GCP project ID | Name of the project that has datasets that will be imported. In our example, it is "**bigquery-public-data**". |
60-
| Authentication mode to the Google Cloud | Type of authentication mode to the Google Cloud. You can select from the 3 options:<br/>- Google Application Credentials,<br/>- JSON Key Content<br/> - JSON Key Path |
61-
| GCP project to create BigQuery jobs, where the authenticated principal has bigquery.jobs.create permission | Google Cloud Platform project which will be used to create BigQuery jobs. In this project, the authenticated user must have bigquery.jobs.create permission. You can select from the 3 options:<br/>- Create jobs in source project<br/>- Create jobs in default project from credentials<br/> - Create jobs in selected billing project ID.<br/>Please pick the third option *Create jobs in selected billing project ID*. You will need your own GCP project where you have permission to run BigQuery jobs. |
62-
| Billing GCP project ID | The ID of the selected billing GCP project. In this project, the authenticated user must have bigquery.jobs.create permission. This field is active when you select the "Create jobs in selected billing project ID" option. <br/> Please fill this field with the name of your own GCP project where you have the right to run BigQuery jobs. Alternatively, it can be your testing project where you are the **owner**. |
63-
| Quota GCP project ID | The Google Cloud Platform project ID which is used for BigQuery quota. You can leave this field empty. |
83+
| CSV connection settings | Description |
84+
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
85+
| Connection name | The name of the connection that will be created in DQOps. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters. |
86+
| Path | The path prefix to the parent directory with data. The path must be absolute. The virtual schema name is a value of the directories mapping. |
6487

6588
After filling in the connection settings, click the **Test Connection** button to test the connection.
89+
It will inform you if the path to the CSV file may be incorrect.
6690

6791
Click the **Save** connection button when the test is successful to add a new connection.
6892
Otherwise, you can check the details of what went wrong.
69-
7093

7194
## **Import metadata using the user interface**
7295

7396
When you add a new connection, it will appear in the tree view on the left, and you will be redirected to the Import Metadata screen.
7497
Now we can import schemas and tables.
7598

76-
1. Import the "austin_crime" schema by clicking on the **Import Tables** button.
99+
1. Import the "files" schema by clicking on the **Import Tables**
77100

78-
![Importing schemas](https://dqops.com/docs/images/getting-started/importing-schema-austin-crime.png)
101+
![Importing schemas](https://dqops.com/docs/images/getting-started/importing-schema-files.png)
79102

80-
2. There is only one table in the dataset. Import the table by clicking **Import all tables** buttons in the upper right corner.
103+
2. Select the file austin_crime.csv marking the checkbox and import by clicking **Import selected tables** button in the upper right corner.
81104

82-
![Importing tables](https://dqops.com/docs/images/getting-started/importing-tables-austin-crime.png)
105+
![Importing tables](https://dqops.com/docs/images/getting-started/importing-tables-austin-crime-csv.png)
83106

84107

85108
## Initiate automatic monitoring and review scheduling
@@ -95,22 +118,22 @@ Within the Advisor, you can collect basic statistics, run profiling checks, or m
95118

96119
To Run basic statistics and profiling checks, click on the appropriate buttons on the advisor.
97120

98-
We will evaluate the results from basic statistics and profiling checks at the next step of the Getting started.
121+
We will evaluate the results from basic statistics and profiling checks at the next step of the Getting started.
99122

100-
![Running basic statistics and profiling checks](https://dqops.com/docs/images/getting-started/running-basics-statistics-and-profiling-checks.png)
123+
![Running basic statistics and profiling checks](https://dqops.com/docs/images/getting-started/running-basics-statistics-and-profiling-checks-csv.png)
101124

102125
### Review scheduling with the Advisor
103126

104-
To review scheduling for profiling and daily monitoring checks, click on the **Review scheduling** button.
127+
To review scheduling for profiling and daily monitoring checks, click on the **Review scheduling** button.
105128

106-
![Review scheduling](https://dqops.com/docs/images/getting-started/review-scheduling.png)
129+
![Review scheduling](https://dqops.com/docs/images/getting-started/review-scheduling-csv.png)
107130

108131
You will be linked to **Data Source** section, **Schedule** tab where you can review scheduling settings for the added connection.
109132

110133
The scheduling is enabled by default. You can turn it off by clicking the notification icon in the upper right corner and
111-
then clicking the **Job scheduler** toggle button.
134+
then clicking the **Job scheduler** toggle button.
112135

113-
![Reviewing data source details](https://dqops.com/docs/images/getting-started/reviewing-data-source-section2.png)
136+
![Reviewing data source details](https://dqops.com/docs/images/getting-started/reviewing-data-source-section-csv1.png)
114137

115138

116139
## Explore the connection-level tabs in the Data sources section
@@ -138,9 +161,9 @@ At the table level in the **Data sources** section, there are the following tabs
138161
- **Date and time columns** - allows [configuring event and ingestion timestamp columns for timeliness checks](../working-with-dqo/run-data-quality-checks.md#configure-event-and-ingestion-timestamp-columns-for-timeliness-checks), as well as [date or datetime column for partition checks](../working-with-dqo/run-data-quality-checks.md#configure-date-or-datetime-column-for-partition-checks).
139162
- **Incident configuration** - allows configuring incidents. [Learn more about incidents](../working-with-dqo/managing-data-quality-incidents-with-dqops.md) that let you keep track of the issues that arise during data quality monitoring.
140163

141-
You can check the details of the imported table by expanding the tree view on the left and selecting the "crime" table.
164+
You can check the details of the imported table by expanding the tree view on the left and selecting the "austin_crime.csv" table.
142165

143-
![Reviewing table details](https://dqops.com/docs/images/getting-started/reviewing-table-details.png)
166+
![Reviewing table details](https://dqops.com/docs/images/getting-started/reviewing-table-details-csv.png)
144167

145168
## Next step
146169

docs/getting-started/index.md

+10-8
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,21 @@
22
This guide contains a quick tutorial on how to get started with DQOps using the web interface, analyze a data source, and review the data quality results.
33

44
## Sample data
5-
In the example, we will a add connection to the [BigQuery public dataset Austin Crime Data](https://console.cloud.google.com/marketplace/details/city-of-austin/austin-crime).
6-
Next, we will run and review [Basic statistics](../working-with-dqo/collecting-basic-data-statistics.md), and automatically added profiling and monitoring [data quality checks](../dqo-concepts/definition-of-data-quality-checks/index.md).
5+
In the example, we will add a **connection to a CSV file** data source.
6+
The file contains a sample of [BigQuery public dataset Austin Crime Data](https://console.cloud.google.com/marketplace/details/city-of-austin/austin-crime).
7+
8+
Next, we will run and review [Basic statistics](../working-with-dqo/collecting-basic-data-statistics.md),
9+
and automatically added profiling and monitoring [data quality checks](../dqo-concepts/definition-of-data-quality-checks/index.md).
10+
711
Finally, we will review the data quality results on the [data quality dashboards](../dqo-concepts/types-of-data-quality-dashboards.md).
812

9-
!!! note "Google BigQuery is not the only supported data source"
13+
!!! note "Diverse connection options in DQOps."
1014

11-
We are using Google BigQuery in the *getting started* guide and [DQOps use cases](../examples/index.md) because
12-
Google provides sample datasets for free. You can reproduce all steps shown in this *getting started* guide
13-
on the same sample data.
15+
The CSV file connection is used in this *getting started* guide because no additional database configuration is needed.
1416

1517
The list of [data sources supported by DQOps](../data-sources/index.md) shows the connection screens to
16-
analyze data quality of other databases. The steps to connect to a different data source are the same as described in this
17-
*getting started* guide.
18+
analyze data quality of other databases.
19+
The steps to connect to a different data source are the same as described in this *getting started* guide.
1820

1921

2022
## Steps

0 commit comments

Comments
 (0)