Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Experimental unity catalog client #20798

Merged
merged 5 commits into from
Jan 20, 2025
Merged

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jan 20, 2025

Introduces an experimental unity catalog client. Note that the API is unstable and subject to change.

The initial version in this PR supports:

  • Listing catalogs, schemas and tables
  • Retrieving table information
  • Reading a table as a LazyFrame for the following data_source_formats:
    • DELTA
    • PARQUET
    • CSV
import polars as pl
from pprint import pprint

# See https://github.com/unitycatalog/unitycatalog for the unity catalog server.
catalog = pl.Catalog("http://localhost:8080")

pprint(catalog.list_catalogs())
# [{"comment": "Main catalog", "name": "unity"}]
pprint(catalog.list_schemas("unity"))
# [{"comment": "Default schema", "name": "default"}]
pprint(catalog.list_tables("unity", "default"))
# [
#     {
#         "columns": [
#             {
#                 "comment": "ID primary key",
#                 "name": "id",
#                 "partition_index": None,
#                 "position": 0,
#                 "type_interval_type": None,
#                 "type_text": "int",
#             },
#             ...,
#         ],
#         "comment": "Managed table",
#         "data_source_format": "DELTA",
#         "name": "marksheet",
#         "storage_location": "file:///Users/nxs/git/unitycatalog/etc/data/managed/unity/default/tables/marksheet/",
#         "table_id": "c389adfa-5c8f-497b-8f70-26c2cca4976d",
#         "table_type": "MANAGED",
#     },
#     ...,
# ]
pprint(catalog.get_table_info("unity", "default", "numbers"))
# {
#     "columns": [
#         {
#             "comment": "Int column",
#             "name": "as_int",
#             "partition_index": None,
#             "position": 0,
#             "type_interval_type": None,
#             "type_text": "int",
#         },
#         ...,
#     ],
#     "comment": "External table",
#     "data_source_format": "DELTA",
#     "name": "numbers",
#     "storage_location": "file:///Users/nxs/git/unitycatalog/etc/data/external/unity/default/tables/numbers/",
#     "table_id": "32025924-be53-4d67-ac39-501a86046c01",
#     "table_type": "EXTERNAL",
# }
print(q := catalog.scan_table("unity", "default", "numbers"))
# naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

# Parquet SCAN [/Users/nxs/git/unitycatalog/etc/data/external/unity/default/tables/numbers/d1df15d1-33d8-45ab-ad77-465476e2d5cd-000.parquet]
# PROJECT */2 COLUMNS
print(q.collect())
# shape: (15, 2)
# ┌────────┬────────────┐
# │ as_int ┆ as_double  │
# │ ---    ┆ ---        │
# │ i32    ┆ f64        │
# ╞════════╪════════════╡
# │ 564    ┆ 188.755356 │
# │ 755    ┆ 883.610563 │
# │ 644    ┆ 203.439559 │
# │ 75     ┆ 277.880219 │
# │ 42     ┆ 403.857969 │
# │ …      ┆ …          │
# │ 294    ┆ 209.322436 │
# │ 150    ┆ 329.197303 │
# │ 539    ┆ 425.661029 │
# │ 247    ┆ 477.742227 │
# │ 958    ┆ 509.371273 │
# └────────┴────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 20, 2025
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 25.19201% with 487 lines in your changes missing coverage. Please review.

Project coverage is 79.62%. Comparing base (3696e53) to head (db564e0).
Report is 70 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/catalog/mod.rs 0.56% 177 Missing ⚠️
crates/polars-io/src/catalog/unity/client.rs 0.00% 102 Missing ⚠️
crates/polars-io/src/catalog/schema.rs 68.75% 55 Missing ⚠️
crates/polars-io/src/catalog/unity/utils.rs 0.00% 53 Missing ⚠️
py-polars/polars/catalog.py 48.75% 41 Missing ⚠️
crates/polars-lazy/src/scan/catalog.rs 0.00% 31 Missing ⚠️
crates/polars-io/src/utils/other.rs 0.00% 21 Missing ⚠️
crates/polars-io/src/path_utils/hugging_face.rs 0.00% 4 Missing ⚠️
crates/polars-python/src/utils.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20798      +/-   ##
==========================================
- Coverage   79.78%   79.62%   -0.17%     
==========================================
  Files        1561     1568       +7     
  Lines      222015   222669     +654     
  Branches     2533     2543      +10     
==========================================
+ Hits       177135   177295     +160     
- Misses      44296    44790     +494     
  Partials      584      584              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Contributor

Ah we actually also have a PR open for this: delta-io/delta-rs#3078, could have shared components of the client


let args = ScanArgsParquet {
schema,
allow_missing_columns: matches!(data_source_format, DataSourceFormat::Delta),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plainly reading delta parquet files is not safe operation, you will have to check the protocol versions whether you are allowed to read it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the review!

This branch is only hit if data_source_format=PARQUET - are there still version controls for this case?

For data_source_format=DELTA I am using the Python-side scan_delta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it should be fine! :)

@ritchie46 ritchie46 merged commit bf57bde into pola-rs:main Jan 20, 2025
28 checks passed
@nameexhaustion nameexhaustion deleted the catalog branch January 24, 2025 13:02
@nrccua-timr
Copy link

y'all are amazing!!!! been waiting for this for over a year!

@c-peters c-peters added the accepted Ready for implementation label Jan 27, 2025
@pustoladxc
Copy link

pustoladxc commented Feb 7, 2025

@nameexhaustion it's awesome to see integration with Unity Catalog, thanks for that! Can you tell me if there are plans to fix the issue I came across?

I'm using Delta Live Tables, they are managed tables and they don't expose storage_location nor data_source_format. Both of these attributes are None when returned by catalog.get_table_info.
Because storage_location is missing, catalog.scan_table fails due to ValueError: cannot scan catalog table: no storage_location found.

Are there plans, and is this even possible, to scan DLT with polars? Thanks!

@nrccua-timr
Copy link

@pustoladxc I had the same issue but got around it by defining the region using the storage_location param passed to scan_table. (ex. storage_options={"AWS_REGION": "us-east-2"})

@pustoladxc
Copy link

@nrccua-timr appreciate your answer, unfortunately this does not work for me. I'm on Azure and there are no "region" nor "location" options available here. Tried all storage related options but none helped.

I remember that when writing to an external table (with storage location available) providing AZURE_STORAGE_ACCOUNT_NAME was, for some reason, necessary indeed.

So you say that for AWS users reading from a Delta Live Table works after providing AWS_REGION, correct?

@nrccua-timr
Copy link

nrccua-timr commented Feb 7, 2025

@pustoladxc There were a few other issues. For example, our databricks delta live tables were created about two years ago and wasn't compatible with polars. However, when I tried the scan_table function on a newly created table it works (granted I needed to disable DeletionVectors because they aren't supported by deltalake, which is what polars using for the backend to interact with unity catalog). Basically, polars team has some more work to do for a smoother experience... but this implementation, as a beginning, is greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants