Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of PushRowPage for high number of cpu cores #11182

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

razdoburdin
Copy link
Contributor

@razdoburdin razdoburdin commented Jan 23, 2025

This PR adds optimal row-wise parallelization for PushRowPage.
The current realization is ineffective for multithread CPUs in case of a dataset has large number of rows and small number of columns. Current PR fixes this bottleneck. I observed up to 15% perf improvements for large datasets like airline on system with 2x56 cores CPU.

upg:
Some illustrations of changes in parallel processing:

We have a sparse input matrix. Each cell corresponds to a feature, and we process each cell individually, placing the results into the output array.
image

The original implementation processes data column by column. This approach is suboptimal when there are many CPU cores but only a few columns.
image

In the initial optimization, each thread processed a subset of rows but included all columns. This approach required significant memory overhead because buffers for intermediate results had to be used.
image

In the current version, I have reorganized the logic. Now, the input matrix is analyzed by the LoadBalance function. If a column is too large, it is split across multiple threads. This reduces buffer sizes since each thread now processes only a single feature.
image

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

@razdoburdin
Copy link
Contributor Author

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

Here is the speed-up in comparison with the current master branch.
image

As for unit-tests, it is not easy to cover such a case. I will try to figure it out.

@trivialfis
Copy link
Member

trivialfis commented Jan 24, 2025

One particular caution about similar optimizations, thread local buffers can cost significant amount of memory usage when the core count is high. For large server CPUs with hundreds of CPUs this can be severe. We partially reverted a similar optimization in the CPU predictor which uses thread local blocks due to this reason.

@trivialfis
Copy link
Member

Ref #6659

@razdoburdin razdoburdin marked this pull request as draft February 4, 2025 13:46
@razdoburdin
Copy link
Contributor Author

razdoburdin commented Feb 18, 2025

Sorry for a long delay. I have reorganized the optimization to reduce memory overhead (see the PR description for some details). I also added some tests to verify the changes.

@razdoburdin razdoburdin marked this pull request as ready for review February 18, 2025 08:31
@trivialfis
Copy link
Member

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

@razdoburdin
Copy link
Contributor Author

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

Here is the benchmarking for the last version of the PR. The numbers in the table >1 mean that this PR is faster than master branch, and consuming more memory.
image

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't quite understand the code. Do you think it's necessary to create this level of complexity for the gain?

size_t n_entries = entries_per_columns[column_idx];

if (n_entries > 0) {
size_t n_splits = std::min(nthreads * n_entries / total_entries, n_entries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is n_splits calculated in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to assign for processing of the columns amount of threads being proportional to number of entries in the column.


std::vector<ThreadWorkLoad> baskets;
std::vector<bool> is_column_splited;
bool has_splitted = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_spitted/has_been_xxx?

Comment on lines +982 to +987
typename WQSketch::SummaryContainer main_summary;
main_summary.Reserve(sketches_[column_idx].limit_size);
typename WQSketch::SummaryContainer split_summary;
split_summary.Reserve(2 * sketches_[column_idx].limit_size);
typename WQSketch::SummaryContainer comb_summary;
comb_summary.Reserve(3 * sketches_[column_idx].limit_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these 2 * and 3 * sizes calculated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of split_summary size guaranties absence of intermediate pruning while calculating summary. While the size of comb_summary is the sum of sizes of main_summary and split_summary

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
@razdoburdin
Copy link
Contributor Author

Hmm, I don't quite understand the code. Do you think it's necessary to create this level of complexity for the gain?
The level of complexity required for this optimization has increased since I started working on it. 😞
Overall, I'm trying to improve scalability for modern high-end CPUs, but it's still not good enough.

image

I suggest we postpone this PR and revisit it once other hotspots are resolved—at that point, the importance of this fix will become more significant.

@trivialfis
Copy link
Member

trivialfis commented Feb 25, 2025

Thank you for sharing. Yes, please take your time and feel free to open refactoring PRs before optimization PRs. This way we can (hopefully) build some abstractions and utilities/helpers before optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants