Rewrite Recap

This commit is pretty much a complete rewrite. I was really unhappy with how complicated things had gettin with Recap, especially the Python API. I've rewritten things to add: * A very simple REPL. * A FastAPI-like metadata crawling API * A basic data catalog * A basic crawler * A storage layer with a graph-like API These changes make it much easier to work with Recap in Python. It also lays the groundwork for complex schema conversion features that I want to write. There's way too much to document in this commit message, so see the updated docs for more information.
gabledata · Feb 27, 2023 · f2d40c9 · f2d40c9
1 parent f365a45
commit f2d40c9
Show file tree

Hide file tree

Showing 66 changed files with 2,672 additions and 2,824 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 </h1>
 
 <p align="center">
-<i>A dead simple data catalog for engineers</i>
+<i>A metadata toolkit written in Python</i>
 </p>
 
 <p align="center">
@@ -14,40 +14,101 @@
 <a href="https://github.com/PyCQA/pylint"><img alt="pylint" src="https://img.shields.io/badge/linting-pylint-yellowgreen"></a>
 </p>
 
-## About
+# About
 
-Recap makes it easy for engineers to build infrastructure and tools that need metadata. Unlike traditional data catalogs, Recap is designed to power software. Read [Recap: A Data Catalog for People Who Hate Data Catalogs](https://cnr.sh/essays/recap-for-people-who-hate-data-catalogs) to learn more.
+Recap is a Python library that helps you build tools for data quality, data goverenance, data profiling, data lineage, data contracts, and schema conversion.
 
 ## Features
 
-* Supports major cloud data warehouses and Postgres
-* No external system dependencies required
-* Designed for the [CLI](https://docs.recap.cloud/latest/cli/)
-* Runs as a [Python API](https://docs.recap.cloud/latest/api/recap.analyzers/) or [REST API](https://docs.recap.cloud/latest/rest/)
-* Fully [pluggable](https://docs.recap.cloud/latest/guides/plugins/)
+* Compatible with [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) filesystems and [SQLAlchemy](https://www.sqlalchemy.org) databases.
+* Built-in support for [Parquet](https://parquet.apache.org), CSV, TSV, and JSON files.
+* Includes [Pandas](https://pandas.pydata.org) for data profiling.
+* Uses [Pydantic](https://pydantic.dev) for metadata models.
+* Convenient [CLI](cli.md), [Python API](api/recap.analyzers.md), and [REST API](rest.md)
+* No external system dependencies.
 
 ## Installation
 
     pip install recap-core
 
-## Commands
+## Usage
 
-* `recap catalog list` - List a data catalog directory.
-* `recap catalog read` - Read metadata from the data catalog.
-* `recap catalog search` - Search the data catalog for metadata.
-* `recap crawl` - Crawl infrastructure and save metadata in the data catalog.
-* `recap plugins analyzers` - List all analyzer plugins.
-* `recap plugins browsers` - List all browser plugins.
-* `recap plugins catalogs` - List all catalog plugins.
-* `recap plugins commands` - List all command plugins.
-* `recap serve` - Start Recap's REST API.
+Grab schemas from filesystems:
 
-## Getting Started
+```python
+schema("s3://corp-logs/2022-03-01/0.json")
+```
+
+And databases:
+
+```python
+schema("snowflake://ycbjbzl-ib10693/TEST_DB/PUBLIC/311_service_requests")
+```
+
+In a standardized format:
+
+```json
+{
+  "fields": [
+    {
+      "name": "unique_key",
+      "type": "VARCHAR",
+      "nullable": false,
+      "comment": "The service request tracking number."
+    },
+    {
+      "name": "complaint_description",
+      "type": "VARCHAR",
+      "nullable": true,
+      "comment": "Service request type"
+    }
+  ]
+}
+```
+
+See what schemas used to look like:
+
+```python
+schema("snowflake://ycbjbzl-ib10693/TEST_DB/PUBLIC/311_service_requests", datetime(2023, 1, 1))
+```
+
+Build metadata extractors:
+
+```python
+@registry.metadata("s3://{path:path}.json", include_df=True)
+@registry.metadata("bigquery://{project}/{dataset}/{table}", include_df=True)
+def pandas_describe(df: DataFrame, *_) -> BaseModel:
+    description_dict = df.describe(include="all")
+    return PandasDescription.parse_obj(description_dict)
+```
 
-See the [Quickstart](https://docs.recap.cloud/latest/quickstart/) page to get started.
+Crawl your data:
 
-## Warning
+```python
+crawl("s3://corp-logs")
+crawl("bigquery://floating-castle-728053")
+```
 
-> ⚠️ This package is still under development and may not be stable. The API may break at any time.
+And read the results:
+
+```python
+search("json_extract(metadata_obj, '$.count') > 9999", PandasDescription)
+```
+
+See where data comes from:
+
+```python
+writers("bigquery://floating-castle-728053/austin_311/311_service_requests")
+```
+
+And where it's going:
+
+```python
+readers("bigquery://floating-castle-728053/austin_311/311_service_requests")
+```
+
+All cached in Recap's catalog.
+
+## Getting Started
 
-Recap is still a little baby application. It's going to wake up crying in the middle of the night. It's going to vomit on the floor once in a while. But if you give it some love and care, it'll be worth it. As time goes on, it'll grow up and be more mature. Bear with it.
+See the [Quickstart](quickstart.md) page to get started.
diff --git a/docs/api/recap.analyzers.md b/docs/api/recap.analyzers.md
diff --git a/docs/api/recap.browsers.md b/docs/api/recap.browsers.md
diff --git a/docs/api/recap.catalog.md b/docs/api/recap.catalog.md
@@ -0,0 +1 @@
+::: recap.catalog
diff --git a/docs/api/recap.catalogs.md b/docs/api/recap.catalogs.md
diff --git a/docs/api/recap.crawler.md b/docs/api/recap.crawler.md
diff --git a/docs/api/recap.integrations.md b/docs/api/recap.integrations.md
@@ -0,0 +1 @@
+::: recap.integrations
diff --git a/docs/api/recap.metadata.md b/docs/api/recap.metadata.md
@@ -0,0 +1 @@
+::: recap.metadata
diff --git a/docs/api/recap.paths.md b/docs/api/recap.paths.md
diff --git a/docs/api/recap.repl.md b/docs/api/recap.repl.md
@@ -0,0 +1 @@
+::: recap.repl
diff --git a/docs/api/recap.storage.md b/docs/api/recap.storage.md
@@ -0,0 +1 @@
+::: recap.storage
diff --git a/docs/cli.md b/docs/cli.md
@@ -1,6 +1,6 @@
 # Recap CLI
 
-Execute Recap's CLI using the `recap` command. Recap's CLI is pluggable, so the `recap` command will have subcommands for each plugin you've installed. By default, Recap ships with the following command plugins.
+Execute Recap's CLI using the `recap` command. The CLI allows you to crawl, search, and read metadata from live systems (using `--refresh`) and Recap's catalog.
 
 ::: mkdocs-typer
     :module: recap.cli

diff --git a/docs/guides/configuration.md b/docs/guides/configuration.md
@@ -1,31 +1,28 @@
-Though Recap's CLI can run without any configuration, you might want to configure Recap using a config file. Recap uses [Dynaconf](https://www.dynaconf.com/) for its configuration system.
+Though Recap's CLI can run without any configuration, you might want to configure Recap. Recap uses Pydantic's [BaseSettings](https://docs.pydantic.dev/usage/settings/) class for its configuration system.
 
-## Config Locations
+## Configs
 
-Configuraton is stored in `~/.recap/settings.toml` by default. You can override the default location by setting the `SETTINGS_FILE_FOR_DYNACONF` environment variable. See Dynaconf's [documentation](https://www.dynaconf.com/configuration/#envvar) for more information.
+See Recap's [config.py](https://github.com/recap-cloud/recap/blob/main/recap/config.py) for all available configuration parameters.
 
-## Schema
+Commonly set environment variables include:
 
-Recap's `settings.toml` has two main sections: `catalog` and `crawlers`.
+```bash
+RECAP_STORAGE_SETTINGS__URL=http://localhost:8000/storage
+RECAP_LOGGING_CONFIG_FILE=/tmp/logging.toml
+```
 
-* The `catalog` section configures the storage layer; it uses SQLite by default. Run `recap plugins catalogs` to see other options.
-* The `crawlers` section defines infrastructure to crawl. Only the `url` field is required. You may optionally specify analyzer `excludes` and path `filters` as well.
+!!! note
 
-```toml
-[catalog]
-plugin = "recap"
-url = "http://localhost:8000"
+	Note the double-underscore (_dunder_) in the `URL` environment variable. This is a common way to set nested dictionary and object values in Pydantic's `BaseSettings` classes. You can also set JSON objects like `RECAP_STORAGE_SETTINGS='{"url": "http://localhost:8000/storage"}'`. See Pydantic's [settings management](https://docs.pydantic.dev/usage/settings/) page for more information.
 
-[[crawlers]]
-url = "postgresql://username@localhost/some_db"
-excludes = [
-	"sqlalchemy.profile"
-]
-filters = [
-	"/**/tables/some_table"
-]
-```
+## Dotenv
+
+Recap supports [.env](https://www.dotenv.org) files to manage environment variables. Simply create a `.env` in your current working directory and use Recap as usual. Pydantic handles the rest.
+
+## Home
+
+RECAP_HOME defines where Recap looks for storage and secret files. By default, RECAP_HOME is set to `~/.recap`.
 
 ## Secrets
 
-Do not store database credentials in your `settings.toml`; use Dynaconf's secret management instead. See Dynaconf's [documentation](https://www.dynaconf.com/secrets/) for more information.
+You can set environment variables with secrets in them using Pydantic's [secret handling mechanism](https://docs.pydantic.dev/usage/settings/#secret-support). By default, Recap looks for secrets in `$RECAP_HOME/.secrets`.
diff --git a/docs/guides/logging.md b/docs/guides/logging.md
@@ -2,7 +2,7 @@ Recap uses Python's standard [logging](https://docs.python.org/3/library/logging
 
 ## Customizing
 
-You can customize Recap's log output. Set the `logging.config.path` value in your `settings.toml` file to point at a [TOML](https://toml.io) file that conforms to Python's [dictConfig](https://docs.python.org/3/library/logging.config.html#logging-config-dictschema) schema.
+You can customize Recap's log output. Set the `RECAP_LOGGING_CONFIG_FILE` environment variable to point to a [TOML](https://toml.io) file that conforms to Python's [dictConfig](https://docs.python.org/3/library/logging.config.html#logging-config-dictschema) schema.
 
 ```toml
 version = 1
@@ -30,4 +30,4 @@ propagate = false
 handlers = ['default']
 level = "INFO"
 propagate = false
-```
+```
diff --git a/docs/guides/plugins.md b/docs/guides/plugins.md