-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory use when writing tables with very short columns to ORC #18136
base: branch-25.04
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks right to me, although I don't know the stripe stream bits well enough to understand why we start off with a 2d span with a dimension that is always of size one. However, this does seem to be reducing the memory usage enough to get CI to pass on the test we've been having problems with, so I'm approving this in the hopes that we can get more data from nightlies over the weekend. Thanks @vuule!
Description
To avoid estimating the maximum compressed size for each actual block in the file, ORC writer uses the estimate for the (uncompressed) block size limit, which defaults to 256KB. However, when we write many small blocks, this compressed block size estimate is much larger than what is needed, leading to high memory use for wide/short tables.
This PR adds logic to take the actual block size into account, and to use the size of the actual largest block in the file, not the largest possible block. This changes the memory usage by orders of magnitude in some tests.
Checklist