Skip to content

Commit

Permalink
Merge pull request #587 from keboola/docs-backlog-256-and-360
Browse files Browse the repository at this point in the history
Docs backlog 256, 360, ...
  • Loading branch information
hhanova authored Mar 5, 2024
2 parents 1ff777c + d2ea17a commit 383f0cc
Show file tree
Hide file tree
Showing 155 changed files with 702 additions and 704 deletions.
16 changes: 8 additions & 8 deletions catalog/multi-project/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Multi-project Architecture
title: Multi-Project Architecture
permalink: /catalog/multi-project/
---

Expand All @@ -11,7 +11,7 @@ and destination systems, or when some of them are very complex or difficult to s
traditional extract-load process into isolated projects.

Let's say that you have a large source database and you want to provide that to data scientists. While it is
very easy for a data scientist to set up a database extractor and get all the data, it might not be a good
very easy for a data scientist to set up a database data source connector and get all the data, it might not be a good
*long-term* solution -- for example:
- The database has some cryptic names that are hard to understand and need to be described.
- The database structure is constantly changing due to a migration process in progress.
Expand All @@ -29,16 +29,16 @@ anything else really.
## Example Scenario
Let's say that you have an existing Keboola project that contains:

- Oracle database extractor with the following configurations:
- Oracle database data source connector with the following configurations:
- `ora-history` -- The source database is over 2TB; the largest table is `op_history`, which can be easily loaded [incrementally](/storage/tables/#incremental-loading) as records are only added; it is updated every 10 minutes.
- `ora-crm` -- Tables from a CRM system. Major updates are being made to the CRM, so the structure of the tables changes often. Every two weeks, a column is renamed or a table is split into two.
- `ora-common` -- Auxiliary tables with addresses and product names that are updated 4 times a year with data from the parent company.
- `ora-is` -- Extracts `ORA_IS_XXX` bunch of tables which represent a port of a legacy information system, where column names had to fit into a 6 character limit.
- `ora-clients` -- The table `app_clients` is updated every working day at 2:30 UTC by a script that checks all connected clients.
- SQL Server extractor with the following configuration:
- SQL Server data source connector with the following configuration:
- `sql-main` -- Contains the entire database of the information system; it can be extracted at any time without problems, but it contains a
`user_sessions` table with session cookies that can be used to steal a user session. Security requires that the table must not leave the SQL Server.
- Google Analytics extractor:
- Google Analytics data source connector:
- There are three Google Analytics accounts. One -- for the main site -- is managed by the IT department, the other two are managed by the marketing department (The IT department does not want to allow the marketing department to access the main site.).
- A bunch of transformations generating reports and writers.

Expand All @@ -59,17 +59,17 @@ The following image illustrates the usage pattern change:
{: .image-popup}
![Schema -- Multi project](/catalog/multi-project/multi-project-2.png)

- All Oracle extractors have been isolated into a separate project --- `Oracle`:
- All Oracle data source connectors have been isolated into a separate project --- `Oracle`:
- The Oracle DBAs take care of that project and share outside the `Oracle` bucket.
- Most tables are shared as they are extracted, with some transformations compensating for the changes in CRM tables. The `ora-is` tables are
processed via transformations where the useful columns are renamed from their 6 character names to meaningful names, and the more obscure
columns are omitted. The Oracle DBAs decided that it's actually easier than to document the legacy tables. All of the Oracle tables are
formally considered to be current at least every 4th hour. However, they're updated more often, which the data scientists exploit in their
[event triggered orchestrations](/orchestrator/running/#event-trigger).
- The SQL Server extractors are in the `MS` project. All the tables are shared as they are extracted using a `MSSSQL` bucket. There are no
- The SQL Server data source connectors are in the `MS` project. All the tables are shared as they are extracted using a `MSSSQL` bucket. There are no
transformations or other components. The project is only accessible to SQL Server DBAs. Formally the tables must be up to date at least
every 4th hour. In reality, they are updated every hour.
- Google Analytics extractors are all in a single project `GA` maintained by the marketing department. The extractor for the main site (managed by the IT) is there too; it was only [externally authorized](/components/#external-authorization) by the IT department. The marketing department is responsible for basic cleaning of the data -- the project contains some cleaning transformations and the `ga_clean` bucket is shared to the organization. The tables in the bucket are updated every 10 minutes, though the formal requirement is that they are updated at least every 4th hour.
- Google Analytics data source connectors are all in a single project `GA` maintained by the marketing department. The data source connector for the main site (managed by the IT) is there too; it was only [externally authorized](/components/#external-authorization) by the IT department. The marketing department is responsible for basic cleaning of the data -- the project contains some cleaning transformations and the `ga_clean` bucket is shared to the organization. The tables in the bucket are updated every 10 minutes, though the formal requirement is that they are updated at least every 4th hour.
- Everything else is in the project `Reporting`. It is possible to further split it to, e.g., a project taking care of newsletters and reporting.

While the above schema may look more complex and difficult to set up, it actually simplifies things a lot. It leads to a very simple
Expand Down
6 changes: 3 additions & 3 deletions components/branches/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,17 @@ with the running configurations. There is no need to duplicate your project's da

### Data Pipelines

When you create an extractor and then transform the data it produces using transformation it behaves the following way in branches:
When you create a data source connector and then transform the data it produces using transformation it behaves the following way in branches:

In production, you created an extractor that extracts your website requests data to a bucket called `in.c-requests`. Then you create a transformation that takes the data from `in.c-requests` and transforms it into aggregated visits to a bucket named `out.c-visits`. You've already executed the pipeline multiple times, so both buckets have production data in them.
In production, you created a data source connector that extracts your website requests data to a bucket called `in.c-requests`. Then you create a transformation that takes the data from `in.c-requests` and transforms it into aggregated visits to a bucket named `out.c-visits`. You've already executed the pipeline multiple times, so both buckets have production data in them.

Now when you switch to a new branch, and run the transformation. It will load the input data from `in.c-requests` and transform it. When it's about to write it back to storage, it will automatically prefix the output bucket with an ID of the branch - `out.c-1234-visits`. Your production data is left intact in `out.c-visits`.

<div class="alert alert-info" markdown="1">
Bucket name is automatically prefixed with branch numeric ID when a job writes to storage in development branch.
</div>

Now you run the extractor in the branch. It stores the data in a bucket prefixed with branch ID - `in.c-1234-requests`. You production data is again left intact in `in.c-requests`.
Now you run the data source connector in the branch. It stores the data in a bucket prefixed with branch ID - `in.c-1234-requests`. You production data is again left intact in `in.c-requests`.
But when you now run the transformation, it will automatically check if you have branch version of the source bucket `in.c-requests`. Because you do have `in.c-1234-requests`, it will load the data from there.

<div class="alert alert-info" markdown="1">
Expand Down
16 changes: 8 additions & 8 deletions components/extractors/communication/email-attachments/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,16 @@ redirect_from:
* TOC
{:toc}

This extractor allows you to import data from email attachments to Keboola.
This data source connector allows you to import data from email attachments to Keboola.
It extracts data from systems generating exports only as CSV files attached to an email.
It can also be used instead of repeated [manual imports of CSV](/tutorial/load/) files.

Tables only get imported with the extractor running. The import is **not** triggered by an email
being sent or received. When running, the extractor imports all emails received since its previous run.
Therefore, it is a good idea to set up the extractor in a [**scheduled** orchestration](/orchestrator/running/#time-schedule).
Tables only get imported with the data source connector running. The import is **not** triggered by an email
being sent or received. When running, the data source connector imports all emails received since its previous run.
Therefore, it is a good idea to set up the data source connector in a [**scheduled** orchestration](/orchestrator/running/#time-schedule).

## Configuration
[Create a new configuration](/components/#creating-component-configuration) of the **Email Attachments** extractor.
[Create a new configuration](/components/#creating-component-configuration) of the **Email Attachments** data source connector.
Each configuration corresponds to a single target email address. If you need
to import emails into different tables, it is advisable to create more configurations.
An email address for sending attachments will be generated when the configuration is created. Use it to send
Expand Down Expand Up @@ -48,11 +48,11 @@ The following conditions must be met:
Click **Run** and confirm.

If the extraction is successful, you will be able to check the processed data in the imported table by clicking on the link.
There may be a delay between the time the email is sent, received, and picked up by the extractor.
There may be a delay between the time the email is sent, received, and picked up by the data source connector.

{: .image-popup}
![Screenshot - Job Detail](/components/extractors/communication/email-attachments/email-attachments-2.png)

*Note: When multiple valid emails are received between the extractor runs, they are imported into separate tables
(`data1` -- `dataN`). If this is not desired, time the sending of the emails and configure the extractor orchestration
***Note:** When multiple valid emails are received between the data source connector runs, they are imported into separate tables
(`data1` -- `dataN`). If this is not desired, time the sending of the emails and configure the connector orchestration
to make sure only one email is processed at a time.*
16 changes: 8 additions & 8 deletions components/extractors/communication/email-imap/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ redirect_from:
* TOC
{:toc}

This extractor allows you to automatically retrieve email contents and/or it's attachments via the [IMAP protocol](https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol).
This data source connector allows you to automatically retrieve email contents and/or it's attachments via the [IMAP protocol](https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol).
It supports incremental loads and IMAP query to define specific criteria.

The IMAP protocol provides several advantages:
Expand All @@ -33,7 +33,7 @@ The IMAP protocol provides several advantages:
| Processors support | Use processor to modify the outputs before saving to storage, e.g. process attachments to be stored in the Tabular Storage |


## Getting started
## Getting Started


Have IMAP service enabled on your Email account. You will need the IMAP credentials (name, password) and the hostname and port information of the IMAP server.
Expand All @@ -43,9 +43,9 @@ Please refer to your email provider for more information.
Note that the app fetches emails from the root `INBOX` folder. If you use labels and filters in Gmail for instance, that move the messages to a different folder,
please set the `imap_folder` configuration parameter.

### Example Using GMAIL account
### Example Using GMAIL Account

- Enable and create [App Password](https://support.google.com/accounts/answer/185833?hl=en) that will be specific for your integration. Name it for instance as `Keboola Extractor`
- Enable and create [App Password](https://support.google.com/accounts/answer/185833?hl=en) that will be specific for your integration. Name it for instance as `Keboola data source connector`
- Fill in your email address in the `Username` field.
- Fill in your generated App Password in the `Password` field.
- Fill in the Gmail imap address: `imap.gmail.com` in the `IMAP host field`
Expand All @@ -65,7 +65,7 @@ Fill in the `Username`, `Password` and the `Hostname` and `Port` of your provide

Click the `Add Row` button and name the row accordingly.

### Search query
### Search Query

Fill in a `Search query` to filter only the emails you want. By default all emails are downloaded. The most common usecase would be to filter the emails
by the Subject and Sender, e.g. `(FROM "sender-email@example.com" SUBJECT "the subject")`. You can create much more complex queries if needed.
Expand All @@ -78,11 +78,11 @@ Refer to the [query syntax](query-syntax) for more examples.

Folder to get the emails from. Defaults to the root folder `INBOX`. For example a label name in GMAIL = folder.

### Mark seen
### Mark as Seen

When checked, emails that have been extracted will be marked as seen in the inbox.

### Period from date
### Period from Date

Use this field to filter only emails received since the specified date. This field supports fixed dates in a format `YYYY-MM-DD` as well as
relative date period e.g. `yesterday`, `1 month ago`, `2 days ago`, etc. We recommend setting this to cover some safety interval, for example `2 days ago` when
Expand All @@ -92,7 +92,7 @@ scheduled to run every day. The data is always upserted incrementally, so there

Check this option to download email content.

### Download attachments
### Download Attachments

When set to true, also the attachments will be downloaded. You may use regex pattern to filter only attachments that are matching your definition.

Expand Down
8 changes: 4 additions & 4 deletions components/extractors/communication/gmail/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ redirect_from:
* TOC
{:toc}

The Gmail Messages extractor allows you to fetch data from your Gmail account.
The Gmail Messages data source connector allows you to fetch data from your Gmail account.

## Authorization
[Create a new configuration](/components/#creating-component-configuration) of the **Gmail Messages** extractor.
[Create a new configuration](/components/#creating-component-configuration) of the **Gmail Messages** connector.
Then click **Authorize Account** to [authorize the configuration](/components/#authorization).
**Your inbox is accessed as read only.**

Expand Down Expand Up @@ -52,7 +52,7 @@ then see it in the list of enabled APIs.
![Screenshot - Google API Console - Fill Credentials](/components/extractors/communication/gmail/google_console_detail.png)

6. Click **Create** and a pop-up window will display your new client ID and client secret credentials.
7. You can now use these credentials in the **Custom Authorization** tab when authorizing the Google Analytics extractor.
7. You can now use these credentials in the **Custom Authorization** tab when authorizing the Google Analytics connector.

{: .image-popup}
![Screenshot - Custom Authorization](/components/extractors/communication/gmail/custom-credentials.png)
Expand All @@ -70,7 +70,7 @@ For more detailed information about querying, follow Google's [Advanced Search](
Don't forget to **Save** the configuration.

## Produced Tables
Data are always imported incrementally. The extractor produces several tables that can be joined together.
Data are always imported incrementally. The data source connector produces several tables that can be joined together.

### Queries
Queries and their messages; it is good to know which query a message came from.
Expand Down
4 changes: 2 additions & 2 deletions components/extractors/communication/google-calendar/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ redirect_from:
* TOC
{:toc}

The Google Calendar extractor uses the [Google Calendar API](https://developers.google.com/calendar/) to download all
The Google Calendar data source connector uses the [Google Calendar API](https://developers.google.com/calendar/) to download all
calendars available in your account, including events and their details (organizer, location, attendees, reminders,
notifications, attachments, description, etc.). It can also be used for downloading Google owned calendars with all national holidays.

## Create New Configuration
[Create a new configuration](/components/#creating-component-configuration) of the **Google Calendar** extractor.
[Create a new configuration](/components/#creating-component-configuration) of the **Google Calendar** connector.
Then click **Authorize Account** to [authorize the configuration](/components/#authorization).

{: .image-popup}
Expand Down
6 changes: 3 additions & 3 deletions components/extractors/communication/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: Communication Extractors
title: Communication Data Source Connectors
permalink: /components/extractors/communication/
redirect_from:
- /extractors/communication/

---

Extractors import data from external sources and integrate it to the Keboola environment.
The following extractors support communication systems:
Data source connectors import data from external sources and integrate it to the Keboola environment.
The following data source connectors support communication systems:

- [Email Attachments](/components/extractors/communication/email-attachments/)
- [Gmail Messages](/components/extractors/communication/gmail/)
Expand Down
8 changes: 4 additions & 4 deletions components/extractors/communication/intercom/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ redirect_from:
* TOC
{:toc}

This extractor fetches data from [Intercom](https://www.intercom.com/).
This data source connector fetches data from [Intercom](https://www.intercom.com/).

## Configuration
Before you start, have a working Intercom account with a plan (a trial will work as well).
[Create a new configuration](/components/#creating-component-configuration) of the **Intercom** extractor.
[Create a new configuration](/components/#creating-component-configuration) of the **Intercom** connector.
Then click **Authorize Account** to [authorize the configuration](/components/#authorization).

Choose one of the configuration templates available: **Basic** or **Conversations** and **Save** the configuration.
Expand Down Expand Up @@ -343,8 +343,8 @@ This table lists all [Conversation parts](https://developers.intercom.com/interc
| `created_at` | The time the conversation part was created |
| `updated_at` | The last time the conversation part was updated |
| `notified_at` | The time the user was notified with the conversation part |
| `assigned_to_type` | The tyoe of the admin that the conversation is assigned to |
| `assigned_to_id` | The id of the admin that the conversation is assigned to (not null only when part_type: assignment) |
| `assigned_to_type` | The type of the admin that the conversation is assigned to |
| `assigned_to_id` | The ID of the admin that the conversation is assigned to (not null only when part_type: assignment) |
| `author_type` | The user or admin type that created the part |
| `author_id` | The user or admin id that created the part |
| `external_id` | Undocumented |
Expand Down
6 changes: 2 additions & 4 deletions components/extractors/communication/ms-outlook/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,10 @@ redirect_from:
* TOC
{:toc}

Microsoft Outlook extractor for Office 365

This extractor is based on IMAP. It allows you to download emails and their attachments from Office 365 accounts.
Microsoft Outlook data source connector for Office 365 is based on IMAP. It allows you to download emails and their attachments from Office 365 accounts.

## Authorization
[Create a new configuration](/components/#creating-component-configuration) of the **MS Outlook** extractor.
[Create a new configuration](/components/#creating-component-configuration) of the **MS Outlook** connector.
Then click **Authorize Account** to [authorize the configuration](/components/#authorization).

## IMAP Settings
Expand Down
Loading

0 comments on commit 383f0cc

Please sign in to comment.