Skip to content

Commit

Permalink
Rewrite Recap
Browse files Browse the repository at this point in the history
This commit is pretty much a complete rewrite. I was really unhappy with how
complicated things had gettin with Recap, especially the Python API. I've
rewritten things to add:

* A very simple REPL.
* A FastAPI-like metadata crawling API
* A basic data catalog
* A basic crawler
* A storage layer with a graph-like API

These changes make it much easier to work with Recap in Python. It also lays
the groundwork for complex schema conversion features that I want to write.

There's way too much to document in this commit message, so see the updated
docs for more information.
  • Loading branch information
criccomini committed Feb 27, 2023
1 parent f365a45 commit f2d40c9
Show file tree
Hide file tree
Showing 66 changed files with 2,672 additions and 2,824 deletions.
107 changes: 84 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
</h1>

<p align="center">
<i>A dead simple data catalog for engineers</i>
<i>A metadata toolkit written in Python</i>
</p>

<p align="center">
Expand All @@ -14,40 +14,101 @@
<a href="https://github.com/PyCQA/pylint"><img alt="pylint" src="https://img.shields.io/badge/linting-pylint-yellowgreen"></a>
</p>

## About
# About

Recap makes it easy for engineers to build infrastructure and tools that need metadata. Unlike traditional data catalogs, Recap is designed to power software. Read [Recap: A Data Catalog for People Who Hate Data Catalogs](https://cnr.sh/essays/recap-for-people-who-hate-data-catalogs) to learn more.
Recap is a Python library that helps you build tools for data quality, data goverenance, data profiling, data lineage, data contracts, and schema conversion.

## Features

* Supports major cloud data warehouses and Postgres
* No external system dependencies required
* Designed for the [CLI](https://docs.recap.cloud/latest/cli/)
* Runs as a [Python API](https://docs.recap.cloud/latest/api/recap.analyzers/) or [REST API](https://docs.recap.cloud/latest/rest/)
* Fully [pluggable](https://docs.recap.cloud/latest/guides/plugins/)
* Compatible with [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) filesystems and [SQLAlchemy](https://www.sqlalchemy.org) databases.
* Built-in support for [Parquet](https://parquet.apache.org), CSV, TSV, and JSON files.
* Includes [Pandas](https://pandas.pydata.org) for data profiling.
* Uses [Pydantic](https://pydantic.dev) for metadata models.
* Convenient [CLI](cli.md), [Python API](api/recap.analyzers.md), and [REST API](rest.md)
* No external system dependencies.

## Installation

pip install recap-core

## Commands
## Usage

* `recap catalog list` - List a data catalog directory.
* `recap catalog read` - Read metadata from the data catalog.
* `recap catalog search` - Search the data catalog for metadata.
* `recap crawl` - Crawl infrastructure and save metadata in the data catalog.
* `recap plugins analyzers` - List all analyzer plugins.
* `recap plugins browsers` - List all browser plugins.
* `recap plugins catalogs` - List all catalog plugins.
* `recap plugins commands` - List all command plugins.
* `recap serve` - Start Recap's REST API.
Grab schemas from filesystems:

## Getting Started
```python
schema("s3://corp-logs/2022-03-01/0.json")
```

And databases:

```python
schema("snowflake://ycbjbzl-ib10693/TEST_DB/PUBLIC/311_service_requests")
```

In a standardized format:

```json
{
"fields": [
{
"name": "unique_key",
"type": "VARCHAR",
"nullable": false,
"comment": "The service request tracking number."
},
{
"name": "complaint_description",
"type": "VARCHAR",
"nullable": true,
"comment": "Service request type"
}
]
}
```

See what schemas used to look like:

```python
schema("snowflake://ycbjbzl-ib10693/TEST_DB/PUBLIC/311_service_requests", datetime(2023, 1, 1))
```

Build metadata extractors:

```python
@registry.metadata("s3://{path:path}.json", include_df=True)
@registry.metadata("bigquery://{project}/{dataset}/{table}", include_df=True)
def pandas_describe(df: DataFrame, *_) -> BaseModel:
description_dict = df.describe(include="all")
return PandasDescription.parse_obj(description_dict)
```

See the [Quickstart](https://docs.recap.cloud/latest/quickstart/) page to get started.
Crawl your data:

## Warning
```python
crawl("s3://corp-logs")
crawl("bigquery://floating-castle-728053")
```

> ⚠️ This package is still under development and may not be stable. The API may break at any time.
And read the results:

```python
search("json_extract(metadata_obj, '$.count') > 9999", PandasDescription)
```

See where data comes from:

```python
writers("bigquery://floating-castle-728053/austin_311/311_service_requests")
```

And where it's going:

```python
readers("bigquery://floating-castle-728053/austin_311/311_service_requests")
```

All cached in Recap's catalog.

## Getting Started

Recap is still a little baby application. It's going to wake up crying in the middle of the night. It's going to vomit on the floor once in a while. But if you give it some love and care, it'll be worth it. As time goes on, it'll grow up and be more mature. Bear with it.
See the [Quickstart](quickstart.md) page to get started.
1 change: 0 additions & 1 deletion docs/api/recap.analyzers.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/api/recap.browsers.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/api/recap.catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: recap.catalog
1 change: 0 additions & 1 deletion docs/api/recap.catalogs.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/api/recap.crawler.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/api/recap.integrations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: recap.integrations
1 change: 1 addition & 0 deletions docs/api/recap.metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: recap.metadata
1 change: 0 additions & 1 deletion docs/api/recap.paths.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/api/recap.repl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: recap.repl
1 change: 1 addition & 0 deletions docs/api/recap.storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: recap.storage
2 changes: 1 addition & 1 deletion docs/cli.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Recap CLI

Execute Recap's CLI using the `recap` command. Recap's CLI is pluggable, so the `recap` command will have subcommands for each plugin you've installed. By default, Recap ships with the following command plugins.
Execute Recap's CLI using the `recap` command. The CLI allows you to crawl, search, and read metadata from live systems (using `--refresh`) and Recap's catalog.

::: mkdocs-typer
:module: recap.cli
Expand Down
39 changes: 18 additions & 21 deletions docs/guides/configuration.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,28 @@
Though Recap's CLI can run without any configuration, you might want to configure Recap using a config file. Recap uses [Dynaconf](https://www.dynaconf.com/) for its configuration system.
Though Recap's CLI can run without any configuration, you might want to configure Recap. Recap uses Pydantic's [BaseSettings](https://docs.pydantic.dev/usage/settings/) class for its configuration system.

## Config Locations
## Configs

Configuraton is stored in `~/.recap/settings.toml` by default. You can override the default location by setting the `SETTINGS_FILE_FOR_DYNACONF` environment variable. See Dynaconf's [documentation](https://www.dynaconf.com/configuration/#envvar) for more information.
See Recap's [config.py](https://github.com/recap-cloud/recap/blob/main/recap/config.py) for all available configuration parameters.

## Schema
Commonly set environment variables include:

Recap's `settings.toml` has two main sections: `catalog` and `crawlers`.
```bash
RECAP_STORAGE_SETTINGS__URL=http://localhost:8000/storage
RECAP_LOGGING_CONFIG_FILE=/tmp/logging.toml
```

* The `catalog` section configures the storage layer; it uses SQLite by default. Run `recap plugins catalogs` to see other options.
* The `crawlers` section defines infrastructure to crawl. Only the `url` field is required. You may optionally specify analyzer `excludes` and path `filters` as well.
!!! note

```toml
[catalog]
plugin = "recap"
url = "http://localhost:8000"
Note the double-underscore (_dunder_) in the `URL` environment variable. This is a common way to set nested dictionary and object values in Pydantic's `BaseSettings` classes. You can also set JSON objects like `RECAP_STORAGE_SETTINGS='{"url": "http://localhost:8000/storage"}'`. See Pydantic's [settings management](https://docs.pydantic.dev/usage/settings/) page for more information.

[[crawlers]]
url = "postgresql://username@localhost/some_db"
excludes = [
"sqlalchemy.profile"
]
filters = [
"/**/tables/some_table"
]
```
## Dotenv

Recap supports [.env](https://www.dotenv.org) files to manage environment variables. Simply create a `.env` in your current working directory and use Recap as usual. Pydantic handles the rest.

## Home

RECAP_HOME defines where Recap looks for storage and secret files. By default, RECAP_HOME is set to `~/.recap`.

## Secrets

Do not store database credentials in your `settings.toml`; use Dynaconf's secret management instead. See Dynaconf's [documentation](https://www.dynaconf.com/secrets/) for more information.
You can set environment variables with secrets in them using Pydantic's [secret handling mechanism](https://docs.pydantic.dev/usage/settings/#secret-support). By default, Recap looks for secrets in `$RECAP_HOME/.secrets`.
4 changes: 2 additions & 2 deletions docs/guides/logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Recap uses Python's standard [logging](https://docs.python.org/3/library/logging

## Customizing

You can customize Recap's log output. Set the `logging.config.path` value in your `settings.toml` file to point at a [TOML](https://toml.io) file that conforms to Python's [dictConfig](https://docs.python.org/3/library/logging.config.html#logging-config-dictschema) schema.
You can customize Recap's log output. Set the `RECAP_LOGGING_CONFIG_FILE` environment variable to point to a [TOML](https://toml.io) file that conforms to Python's [dictConfig](https://docs.python.org/3/library/logging.config.html#logging-config-dictschema) schema.

```toml
version = 1
Expand Down Expand Up @@ -30,4 +30,4 @@ propagate = false
handlers = ['default']
level = "INFO"
propagate = false
```
```
103 changes: 0 additions & 103 deletions docs/guides/plugins.md

This file was deleted.

Loading

0 comments on commit f2d40c9

Please sign in to comment.