Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory use when writing tables with very short columns to ORC #18136

Open
wants to merge 10 commits into
base: branch-25.04
Choose a base branch
from

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 28, 2025

Description

To avoid estimating the maximum compressed size for each actual block in the file, ORC writer uses the estimate for the (uncompressed) block size limit, which defaults to 256KB. However, when we write many small blocks, this compressed block size estimate is much larger than what is needed, leading to high memory use for wide/short tables.

This PR adds logic to take the actual block size into account, and to use the size of the actual largest block in the file, not the largest possible block. This changes the memory usage by orders of magnitude in some tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 28, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Feb 28, 2025
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 28, 2025
@github-actions github-actions bot added the CMake CMake build issue label Mar 1, 2025
@vuule vuule marked this pull request as ready for review March 1, 2025 00:56
@vuule vuule requested review from a team as code owners March 1, 2025 00:56
@vuule vuule requested review from vyasr and shrshi March 1, 2025 00:56
@vuule vuule changed the title Fix for failed nightly tests on 11.4 Reduce memory use when writing tables with very short columns to ORC Mar 1, 2025
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks right to me, although I don't know the stripe stream bits well enough to understand why we start off with a 2d span with a dimension that is always of size one. However, this does seem to be reducing the memory usage enough to get CI to pass on the test we've been having problems with, so I'm approving this in the hopes that we can get more data from nightlies over the weekend. Thanks @vuule!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants