-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to best optimize reading from S3? #278
Comments
Hi @stevbear Optimizing the reading from S3 has been on my list for a while and I just hadn't gotten around to it. It didn't get prioritized because no one had filed any issues concerning it, so thank you for filing this!
This is likely because the default buffer size is 16KB, have you tried increasing the I have a few ideas to further optimize via pre-buffering (which all have different trade-offs) so can you give me a bit more context to make sure that your use case would be helped and to identify which idea would work best for you? Specifically: If reading an entire column for a single row group gives you OOM, you either have a significantly large row group, or I'm guessing it's string data with a lot of large strings? That leads to the question of what you're doing with the column data after you read it from the row group. If you can't hold the entire column in memory from a single row group, are you streaming the data somewhere? Are you reading only a single column at a time or multiple columns from the row group? Can you give me more of an idea of the sizes of the columns / row group of the file and the memory limitations of your system? Is the issue the copy that happens when decoding/decompressing the column data? etc. The more information the better so we can figure out a good solution here, gives me the opportunity to improve the memory usage of the parquet package like I've been wanting to! 😄 |
Hi @zeroshade, thanks for getting back to me. |
The BufferSize matters because it controls the underlying
Assuming the Does that make sense? |
Thanks! That makes complete sense! |
That's where the current problem is, as it currently stands there isn't a good way to actually use the column and offset indexes when reading pages as right now there isn't a way to tell the page reader to skip an entire page or to utilize the pagelocation/offset information to go to a specific page. It would be helpful if you had an example of your use case scenario, and we could potentially work together to figure out what a good new API would look like to add support for leveraging the indexes to skip pages etc. |
I see. Thanks for that explanation.
|
I started implementing this a bit and realized that when you start dealing with repeated columns a It makes more sense with the Arrow column and record readers in the |
### Rationale for this change Building towards proper support for Skipping rows to address #278 we need to be able to efficiently discard values from decoders rather than actively having to allocate a buffer and read into it. This makes the skipping/seeking within a page more efficient. ### What changes are included in this PR? Adding a new `Discard` method to the `Decoder` interface along with implementations for all the various decoders ### Are these changes tested? Yes, tests are added for the various decoders. ### Are there any user-facing changes? No, this is in an internal package.
@stevbear can you take a look at the linked PR and let me know if this works for you or if we need more work on this front? |
Thank you @zeroshade, this looks good to me! |
### Rationale for this change Addressing the comments in #278 (comment) to allow for optimizing reads by skipping entire pages and leveraging the offset index if it exists. ### What changes are included in this PR? Deprecating the old `NewColumnChunkReader` and `NewPageReader` methods as they really aren't safe to use outside of the package, and have proved difficult to evolve without breaking changes. Instead users should rely on using the `RowGroupReader` to perform the creation of the column readers and page readers, which is generally what is done by consumers already. Adding `SeekToRow` method on the ColumnChunkReader to allow skipping to a particular row in the column chunk (which also allows quickly resetting back to the beginning of a column!) along with `SeekToPageWithRow` method on the page reader. Also updates the `Skip` method to properly skip *rows* in a repeated column, not just values. ### Are these changes tested? Yes, tests are included. ### Are there any user-facing changes? Just the new methods. The deprecated methods are not removed currently.
Describe the usage question you have. Please include as many useful details as possible.
Hi!
I have a use case of reading certain row groups from S3.
I see that there is an option BufferedStreamEnabled.
When I set BufferedStreamEnabled to false, it seems to try to read all of the data of a column for a row group at once, which will, unfortunately, result in OOM for us.
When I set BufferedStreamEnabled to true, the library seems to be reading the row group page by page, which is not optimal for cloud usage.
How can I improve this? I imagine that the best way to improve this would be to read multiple pages in one read() sys call?
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: