Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pymongoarrow replaces existing codecs #262

Open
norrbom opened this issue Jan 21, 2025 · 5 comments
Open

pymongoarrow replaces existing codecs #262

norrbom opened this issue Jan 21, 2025 · 5 comments

Comments

@norrbom
Copy link

norrbom commented Jan 21, 2025

pymongoarrow replaces existing codecs

The api.write() method replaces any existing codecs.

The collection TypeRegistry is replaced with a new instance, effectively removing any existing custom codecs.
https://github.com/mongodb-labs/mongo-arrow/blob/1.6.0/bindings/python/pymongoarrow/api.py#L419

A related issue is that pymongoarrow uses pyarrow.to_pylist() to convert a pyarrow.Table to Python objects as an intermediate step before converting to raw BSON.

Pyarrow converts Arrow decimal128 types to Python Decimal objects. However, BSON cannot handle Decimal types, necessitating the use of a custom codec.
https://github.com/mongodb-labs/mongo-arrow/blob/1.6.0/bindings/python/pymongoarrow/api.py#L385

@ShaneHarvey
Copy link
Collaborator

Thanks, for reporting. I'm curious, how did you discover this? Specifically, what type codecs are you configuring (Decimal -> Decimal128, something else)? I'm wondering if those converters should be enabled by default or if we should really inherit the current TypeRegistry.

@blink1073
Copy link
Member

I created https://jira.mongodb.org/browse/INTPYTHON-497, which will require a change to pymongo to enable extending an existing TypeRegistry.

@norrbom
Copy link
Author

norrbom commented Feb 13, 2025

Thanks, for reporting. I'm curious, how did you discover this? Specifically, what type codecs are you configuring (Decimal -> Decimal128, something else)? I'm wondering if those converters should be enabled by default or if we should really inherit the current TypeRegistry.

My use case is to read Parquet and write to MongoDB.

This is what happens:

  • Read from Parquet using pyarrow.parquet.read_table() -> pyarrow.Table
  • PyMongoArrow converts Arrow to Python (via pyarrow.RecordBatch.to_pylist()). Decimal128 becomes Decimal(), which is correct
  • PyMongoArrow converts Python to BSON using the Python BSON module → this is where it goes wrong; BSON cannot handle Decimal().

Yes, I believe it makes sense to have the Decimal → Decimal128 conversion as the default. Ideally, PyMongoArrow should manage all type conversions from Parquet to BSON.

Do you have any plans to open-source the Parquet-to-BSON conversion code (and vice versa) used in Atlas, its Go right?

@blink1073
Copy link
Member

Hi @norrbom, I'd consider it a bug for any Parquet types to not be handled by PyMongoArrow. I opened https://jira.mongodb.org/browse/INTPYTHON-520 to track the bug. I can't speak to Atlas Data Federation, that's a whole different part of MongoDB 😄.

@blink1073
Copy link
Member

In #278 we're adding support for the PyArrow Decimal128 type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants