Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData #39337

Merged
merged 7 commits into from
Jan 17, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Dec 21, 2023

Rationale for this change

Enhance the style of parquet::FileMetaData::Subset.

Are these changes tested?

Already tested

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wgtmac as a code owner December 21, 2023 13:10
@mapleFU
Copy link
Member Author

mapleFU commented Dec 21, 2023

cc @wgtmac @pitrou

Copy link

⚠️ GitHub issue #39336 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 21, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but see docstring suggestions

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@mapleFU
Copy link
Member Author

mapleFU commented Dec 21, 2023

comment fixed now

Comment on lines 799 to 801
// Discard row groups that are not in the subset
output_metadata->num_rows = 0;
output_metadata->row_groups.clear();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

Here will introduce a more round for copying metadata_->row_groups.

Should I:

bool isset_rowgroup = metadata->isset.row_groups;
metadata->isset.row_groups = false;
auto row_groups = std::move(metadata->row_groups);
// Copying thrift
// ...
// Assign row_groups back
metadata->row_groups = std::move(row_groups);
metadata->isset.row_groups = ...
// .. Handle RowGroups

? cc @wgtmac @pitrou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me.

@@ -1839,7 +1843,7 @@ class FileMetaDataBuilder::FileMetaDataBuilderImpl {
std::unique_ptr<FileMetaData> Finish(
const std::shared_ptr<const KeyValueMetadata>& key_value_metadata) {
int64_t total_rows = 0;
for (auto row_group : row_groups_) {
for (const auto& row_group : row_groups_) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another problem I found during reading the code

@mapleFU
Copy link
Member Author

mapleFU commented Jan 8, 2024

@pitrou @wgtmac I've updated the logic for copying row-groups, would you mind take a look again?

{
// GH-39336: Copy the metadata, but only the row groups we need.
auto temp_row_groups = std::move(this->metadata_->row_groups);
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add try-catch block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think make_unique and copying might throw exception( like std::bad_alloc), and even when ex is thrown, we'd better keep this not changed (otherwise it would be hard to debugging?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it worths this kind of complexity to handle the exception just in order to avoid the copy.

@mapleFU mapleFU requested review from pitrou and wgtmac January 10, 2024 07:37
metadata->schema = metadata_->schema;
{
// GH-39336: Copy the metadata, but only the row groups we need.
auto temp_row_groups = std::move(this->metadata_->row_groups);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of doing this? Overall, are these changes actually bringing a tangible benefit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin code don't copying the previous row-group. If we have a large file with many row-groups (it's possible), the more around of copy waste a lot of cpu time. So I do this as an optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem is that you're temporarily mutating this in a method that's not supposed to mutate it. This is brittle (look at that try/except) and also thread-unsafe.

Copy link
Member Author

@mapleFU mapleFU Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the thread-safety is a problem here. Maybe keeping the current code is ok

@mapleFU mapleFU closed this Jan 17, 2024
@mapleFU mapleFU changed the title GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData::Subset GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData Jan 17, 2024
@mapleFU mapleFU reopened this Jan 17, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Jan 17, 2024

@wgtmac @pitrou I remove the Subset change in pr and issue. This pr just change the style, and reduce some copy using reference or std::move

@mapleFU mapleFU merged commit f15a700 into apache:main Jan 17, 2024
34 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Jan 17, 2024
@mapleFU mapleFU deleted the tiny-metadata-style-enhancement branch January 17, 2024 12:37
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit f15a700.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

idailylife pushed a commit to idailylife/arrow that referenced this pull request Jan 18, 2024
…FileMetaData (apache#39337)

### Rationale for this change

Enhance the style of `parquet::FileMetaData::Subset`.

### Are these changes tested?

Already tested

### Are there any user-facing changes?

no

* Closes: apache#39336

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…FileMetaData (apache#39337)

### Rationale for this change

Enhance the style of `parquet::FileMetaData::Subset`.

### Are these changes tested?

Already tested

### Are there any user-facing changes?

no

* Closes: apache#39336

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…FileMetaData (apache#39337)

### Rationale for this change

Enhance the style of `parquet::FileMetaData::Subset`.

### Are these changes tested?

Already tested

### Are there any user-facing changes?

no

* Closes: apache#39336

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…FileMetaData (apache#39337)

### Rationale for this change

Enhance the style of `parquet::FileMetaData::Subset`.

### Are these changes tested?

Already tested

### Are there any user-facing changes?

no

* Closes: apache#39336

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…FileMetaData (apache#39337)

### Rationale for this change

Enhance the style of `parquet::FileMetaData::Subset`.

### Are these changes tested?

Already tested

### Are there any user-facing changes?

no

* Closes: apache#39336

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Minor: Style enhancement for parquet::FileMetaData
3 participants