-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance concern: fillHoles() method and read buffer expansion efficiency. #599
Comments
@lidavidm @liyafan82 @tianchen92 @kou may be you have any insights about that? |
The offset buffer needs to remain valid, and it needs to be nondecreasing, so it can't just be set to 0. The data buffer could probably be filled via |
Or really the data buffer doesn't need to be expanded at all (so that can be omitted). But the offset buffer does need to be filled and it can't just be |
(BTW, we do try to keep an eye on things, even if maintainer bandwidth is low. If you need lots of eyes on something, user@arrow.apache.org or dev@arrow.apache.org is likely faster than guessing who to ping.) |
In this case I think you would be better served by the new string view vector, which is unfortunately not fully implemented in Java (contributions are welcome). In this case the view buffer could be easily preallocated and memset up front (setting it all to 0 is perfectly fine) and then only slots with string data need to be touched. (BTW, I am assuming the data is a bit sparse? Otherwise I think fillHoles should not be doing much work. It's just trying to ensure that if you set slot |
Yeah, I’ve seen and tried this implementation in data fusion (string views on top of data pages), and it outperforms Java quite well. That was my second question (is there string view impl. in java) ;-) |
There's a number of issues but #85 and #86 are the important ones, I believe, plus #106 to do some benchmarking of different cases and likely we should expand the docs too (#468 (comment)) |
As in, adding existing buffers to a StringView vector? I don't think the current API supports that but it would be good to figure out a way to do so. |
Yes, as far as I understand from the code and explanation.
|
Hi everyone,
I have the following use case: I’m benchmarking the read throughput performance when dealing with a large number of non-dictionary string columns (300 columns). Based on the profiler output (see the attached picture), I’ve noticed that a significant amount of time is spent in the fillHoles() method, which is part of the read buffer expansion process.
My question is: why is the buffer filled one element at a time instead of using a bulk operation? Wouldn’t a batch approach be more efficient?
Looking forward to your insights. Thanks!
The text was updated successfully, but these errors were encountered: