-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35331: [Python] Expose Parquet sorting metadata #37665
GH-35331: [Python] Expose Parquet sorting metadata #37665
Conversation
191309d
to
8b9fffe
Compare
b31ef33
to
c6bb3fb
Compare
c6bb3fb
to
5db4e8e
Compare
@mapleFU Could you also take a look at this PR when you get a chance? |
Ooops seems CI failed. I'm a little busy this week, but I'll take a glance later this week :-) Maybe you can call other for review if it's hurry |
Sorry! Will do. |
Notes | ||
----- | ||
|
||
Column indices are zero-based, refer only to leaf fields, and are in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, in current implemention, parquet itself will not check the sorting_order
is satisfied in user input. So we'd better:
- Maybe only enable writing for test?
- Or tell the user this is so dangerous and user should set this order if he/she cannot not gurantee ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is made clear in the docstrings of the Parquet Writer:
arrow/python/pyarrow/parquet/core.py
Lines 885 to 888 in 5db4e8e
sorting_columns : Sequence of SortingColumn, default None | |
Specify the sort order of the data being written. The writer does not sort | |
the data nor does it verify that the data is sorted. The sort order is | |
written to the row group metadata, which can then be used by readers. |
5db4e8e
to
65a5798
Compare
A rebase seems to have fixed the failing tests and now it is just AppVeyor failing (which I think is unrelated to these changes). Could someone please review this when they get the chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
65a5798
to
c162438
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question otherwise LGTM 👍
python/pyarrow/_parquet.pyx
Outdated
return out | ||
|
||
|
||
def _sort_keys_to_sorting_columns(sort_keys, null_placement, Schema schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is not used anywhere - do you know, is it meant to be as a possible helper function? Looks like a duplicate of from_sort_order
to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do look to do exactly the same thing so almost certainly just accidentially left in when the classmethod was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppVeyor failure is not connected. Thanks again @judahrand !
@jorisvandenbossche would you like to have a view before we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! I am bit hesitant about the custom to_sort_order / from_sort_order API ..
out = dict() | ||
|
||
cdef SchemaDescriptor* parquet_schema = sp_parquet_schema.get() | ||
|
||
for i in range(parquet_schema.num_columns()): | ||
name = frombytes(parquet_schema.Column(i).path().get().ToDotString()) | ||
out[name] = i | ||
|
||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be done with existing ParquetSchema cython class (parquet_schema[idx].path
instead of creating the dict + lookup), but I suppose the problem is that we currently don't allow to create a ParquetSchema from scratch from a pyarrow schema, but only from an actual FileMetaData object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. A from_arrow_schema
classmethod could be introduced on ParquetSchema
but that feels a bit out of scope of these changes.
e76aeaa
to
fcfd65b
Compare
6b86bcd
to
7c8849f
Compare
Ping @jorisvandenbossche |
Gentle ping @jorisvandenbossche |
@AlenkaF would you like to move it forward? Or lets wait after January |
Will wait till tomorrow for any comments otherwise I think this could get merged 👍 |
Just one small comment before we merge: #37665 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! +1
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cc9e649. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Good |
Rationale for this change
Picking up where #35453 left off.
Closes #35331
This PR builds on top of #37469
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
sorting_columns
in RowGroupMetaData for Parquet files #35331