-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData #39337
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but see docstring suggestions
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
comment fixed now |
cpp/src/parquet/metadata.cc
Outdated
// Discard row groups that are not in the subset | ||
output_metadata->num_rows = 0; | ||
output_metadata->row_groups.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
Here will introduce a more round for copying metadata_->row_groups
.
Should I:
bool isset_rowgroup = metadata->isset.row_groups;
metadata->isset.row_groups = false;
auto row_groups = std::move(metadata->row_groups);
// Copying thrift
// ...
// Assign row_groups back
metadata->row_groups = std::move(row_groups);
metadata->isset.row_groups = ...
// .. Handle RowGroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me.
@@ -1839,7 +1843,7 @@ class FileMetaDataBuilder::FileMetaDataBuilderImpl { | |||
std::unique_ptr<FileMetaData> Finish( | |||
const std::shared_ptr<const KeyValueMetadata>& key_value_metadata) { | |||
int64_t total_rows = 0; | |||
for (auto row_group : row_groups_) { | |||
for (const auto& row_group : row_groups_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another problem I found during reading the code
cpp/src/parquet/metadata.cc
Outdated
{ | ||
// GH-39336: Copy the metadata, but only the row groups we need. | ||
auto temp_row_groups = std::move(this->metadata_->row_groups); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you add try-catch block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think make_unique
and copying might throw exception( like std::bad_alloc
), and even when ex is thrown, we'd better keep this
not changed (otherwise it would be hard to debugging?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it worths this kind of complexity to handle the exception just in order to avoid the copy.
cpp/src/parquet/metadata.cc
Outdated
metadata->schema = metadata_->schema; | ||
{ | ||
// GH-39336: Copy the metadata, but only the row groups we need. | ||
auto temp_row_groups = std::move(this->metadata_->row_groups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of doing this? Overall, are these changes actually bringing a tangible benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin code don't copying the previous row-group. If we have a large file with many row-groups (it's possible), the more around of copy waste a lot of cpu time. So I do this as an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My problem is that you're temporarily mutating this
in a method that's not supposed to mutate it. This is brittle (look at that try/except) and also thread-unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the thread-safety is a problem here. Maybe keeping the current code is ok
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit f15a700. There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
Enhance the style of
parquet::FileMetaData::Subset
.Are these changes tested?
Already tested
Are there any user-facing changes?
no