Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance concern: fillHoles() method and read buffer expansion efficiency. #599

Open
sherman opened this issue Feb 10, 2025 · 9 comments
Labels
help wanted Extra attention is needed Type: enhancement New feature or request

Comments

@sherman
Copy link

sherman commented Feb 10, 2025

Hi everyone,

I have the following use case: I’m benchmarking the read throughput performance when dealing with a large number of non-dictionary string columns (300 columns). Based on the profiler output (see the attached picture), I’ve noticed that a significant amount of time is spent in the fillHoles() method, which is part of the read buffer expansion process.

My question is: why is the buffer filled one element at a time instead of using a bulk operation? Wouldn’t a batch approach be more efficient?

Looking forward to your insights. Thanks!

Image

@sherman sherman added the Type: enhancement New feature or request label Feb 10, 2025
@sherman sherman changed the title Performance Concern: fillHoles() method and read buffer expansion efficiency. Performance concern: fillHoles() method and read buffer expansion efficiency. Feb 10, 2025
@orchestr7
Copy link

@lidavidm @liyafan82 @tianchen92 @kou may be you have any insights about that?

@lidavidm
Copy link
Member

The offset buffer needs to remain valid, and it needs to be nondecreasing, so it can't just be set to 0.

The data buffer could probably be filled via MemoryUtil#setBytes rather than repeatedly copying from a zeroed buffer.

@lidavidm
Copy link
Member

Or really the data buffer doesn't need to be expanded at all (so that can be omitted). But the offset buffer does need to be filled and it can't just be memset barring special cases.

@lidavidm lidavidm added the help wanted Extra attention is needed label Feb 10, 2025
@lidavidm
Copy link
Member

(BTW, we do try to keep an eye on things, even if maintainer bandwidth is low. If you need lots of eyes on something, user@arrow.apache.org or dev@arrow.apache.org is likely faster than guessing who to ping.)

@lidavidm
Copy link
Member

In this case I think you would be better served by the new string view vector, which is unfortunately not fully implemented in Java (contributions are welcome). In this case the view buffer could be easily preallocated and memset up front (setting it all to 0 is perfectly fine) and then only slots with string data need to be touched.

(BTW, I am assuming the data is a bit sparse? Otherwise I think fillHoles should not be doing much work. It's just trying to ensure that if you set slot N that all the slots before N are valid.)

@sherman
Copy link
Author

sherman commented Feb 10, 2025

In this case I think you would be better served by the new string view vector, which is unfortunately not fully implemented in Java (contributions are welcome). In this case the view buffer could be easily preallocated and memset up front (setting it all to 0 is perfectly fine) and then only slots with string data need to be touched.

(BTW, I am assuming the data is a bit sparse? Otherwise I think fillHoles should not be doing much work. It's just trying to ensure that if you set slot N that all the slots before N are valid.)

Yeah, I’ve seen and tried this implementation in data fusion (string views on top of data pages), and it outperforms Java quite well.

That was my second question (is there string view impl. in java) ;-)

@lidavidm
Copy link
Member

There's a number of issues but #85 and #86 are the important ones, I believe, plus #106 to do some benchmarking of different cases and likely we should expand the docs too (#468 (comment))

@lidavidm
Copy link
Member

string views on top of data pages

As in, adding existing buffers to a StringView vector? I don't think the current API supports that but it would be good to figure out a way to do so.

@sherman
Copy link
Author

sherman commented Feb 11, 2025

string views on top of data pages

As in, adding existing buffers to a StringView vector? I don't think the current API supports that but it would be good to figure out a way to do so.

Yes, as far as I understand from the code and explanation.
Check out this blog post by the author:

For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants