-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of PushRowPage for high number of cpu cores #11182
base: master
Are you sure you want to change the base?
Optimization of PushRowPage for high number of cpu cores #11182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?
One particular caution about similar optimizations, thread local buffers can cost significant amount of memory usage when the core count is high. For large server CPUs with hundreds of CPUs this can be severe. We partially reverted a similar optimization in the CPU predictor which uses thread local blocks due to this reason. |
Ref #6659 |
…om/razdoburdin/xgboost into dev/cpu/push_row_page_optimisation
Sorry for a long delay. I have reorganized the optimization to reduce memory overhead (see the PR description for some details). I also added some tests to verify the changes. |
Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't quite understand the code. Do you think it's necessary to create this level of complexity for the gain?
size_t n_entries = entries_per_columns[column_idx]; | ||
|
||
if (n_entries > 0) { | ||
size_t n_splits = std::min(nthreads * n_entries / total_entries, n_entries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is n_splits
calculated in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to assign for processing of the columns amount of threads being proportional to number of entries in the column.
|
||
std::vector<ThreadWorkLoad> baskets; | ||
std::vector<bool> is_column_splited; | ||
bool has_splitted = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_spitted
/has_been_xxx
?
typename WQSketch::SummaryContainer main_summary; | ||
main_summary.Reserve(sketches_[column_idx].limit_size); | ||
typename WQSketch::SummaryContainer split_summary; | ||
split_summary.Reserve(2 * sketches_[column_idx].limit_size); | ||
typename WQSketch::SummaryContainer comb_summary; | ||
comb_summary.Reserve(3 * sketches_[column_idx].limit_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these 2 *
and 3 *
sizes calculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of split_summary size guaranties absence of intermediate pruning while calculating summary. While the size of comb_summary is the sum of sizes of main_summary and split_summary
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
I suggest we postpone this PR and revisit it once other hotspots are resolved—at that point, the importance of this fix will become more significant. |
Thank you for sharing. Yes, please take your time and feel free to open refactoring PRs before optimization PRs. This way we can (hopefully) build some abstractions and utilities/helpers before optimization. |
This PR adds optimal row-wise parallelization for
PushRowPage
.The current realization is ineffective for multithread CPUs in case of a dataset has large number of rows and small number of columns. Current PR fixes this bottleneck. I observed up to 15% perf improvements for large datasets like airline on system with 2x56 cores CPU.
upg:
Some illustrations of changes in parallel processing:
We have a sparse input matrix. Each cell corresponds to a feature, and we process each cell individually, placing the results into the output array.

The original implementation processes data column by column. This approach is suboptimal when there are many CPU cores but only a few columns.

In the initial optimization, each thread processed a subset of rows but included all columns. This approach required significant memory overhead because buffers for intermediate results had to be used.

In the current version, I have reorganized the logic. Now, the input matrix is analyzed by the

LoadBalance
function. If a column is too large, it is split across multiple threads. This reduces buffer sizes since each thread now processes only a single feature.