Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1580: Change default DataBuffer constructor to use reserve instead of resize #1739

Closed
wants to merge 7 commits into from

Conversation

XinyuZeng
Copy link
Contributor

@XinyuZeng XinyuZeng commented Jan 10, 2024

What changes were proposed in this pull request?

See #1430 - default DataBuffer constructor now uses reserve.

I also add a zeroOut function to manually memset the buffer. For example, MapColumnWrite::add may use uninitialized offsets value (at uint64_t elemOffset = static_cast<uint64_t>(offsets[0]);). i.e., not memset the offsets buffer will result in segfault because the code implicitly assume the value should be 0. Therefore I still zeroOut the buffer for Map, List and Union, even though List and Union pass the unit tests without memset.

Why are the changes needed?

Reduce overhead in ColumnVectorBatch construction and compression/decompression buffer construction

How was this patch tested?

Correctness via make test-out. I further use orc-scan utility, setting batchSize to 1 million and then scan a file with 1 million rows and 20 columns. After this PR, the time reduces from 2.9s to 2.2s. The compression/decompression buffer overhead reported in #1430 can be verified by the reporter if necessary.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CPP label Jan 10, 2024
@@ -51,6 +51,7 @@ namespace orc {

public:
DataBuffer(MemoryPool& pool, uint64_t _size = 0);
DataBuffer(MemoryPool& pool, uint64_t _size, bool noMemSet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we use noMemSet=false?

Copy link
Contributor Author

@XinyuZeng XinyuZeng Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never use. The problem is that it is not possible to provide explicit template <bool noMemSet> to a constructor. Perhaps we can allow this branch in constructor? That makes the code better but introduce a branch that should be resolved at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new commit I tried to use an empty class tag type to differentiate the constructor

@wgtmac
Copy link
Member

wgtmac commented Jan 13, 2024

cc @stiga-huang @ffacs @coderex2522

Copy link
Contributor

@stiga-huang stiga-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are some other places that we can replace resize() with reserve(), e.g. in BlockDecompressionStream::NextDecompress()

orc/c++/src/Compression.cc

Lines 762 to 765 in ffb0c87

if (inputDataBuffer.capacity() < remainingLength) {
inputDataBuffer.resize(remainingLength);
}
::memcpy(inputDataBuffer.data(), inputBuffer, availableSize);

We can further optimize them in follow-up tasks.

@ffacs
Copy link
Contributor

ffacs commented Jan 15, 2024

+1, LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XinyuZeng , could you revise the PR title and description? For example, PR title seems to become inconsistent from the AS-IS code.

@XinyuZeng XinyuZeng changed the title ORC-1580: Remove unnecessary memset in (some) DataBuffer constructor ORC-1580: Change default DataBuffer constructor to use reserve instead of resize Jan 17, 2024
@XinyuZeng
Copy link
Contributor Author

@XinyuZeng , could you revise the PR title and description? For example, PR title seems to become inconsistent from the AS-IS code.

Done

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @XinyuZeng and all!

@dongjoon-hyun dongjoon-hyun added this to the 2.0.0 milestone Jan 17, 2024
dongjoon-hyun pushed a commit that referenced this pull request Jan 17, 2024
…d of resize

### What changes were proposed in this pull request?

See #1430 - default DataBuffer constructor now uses `reserve`.

I also add a `zeroOut` function to manually memset the buffer. For example, `MapColumnWrite::add` may use uninitialized `offsets` value (at `uint64_t elemOffset = static_cast<uint64_t>(offsets[0]);`). i.e., not memset the `offsets` buffer will result in segfault because the code implicitly assume the value should be 0. Therefore I still `zeroOut` the buffer for Map, List and Union, even though List and Union pass the unit tests without memset.

### Why are the changes needed?

Reduce overhead in `ColumnVectorBatch` construction and compression/decompression buffer construction

### How was this patch tested?

Correctness via `make test-out`. I further use `orc-scan` utility, setting `batchSize` to 1 million and then scan a file with 1 million rows and 20 columns. After this PR, the time reduces from 2.9s to 2.2s. The compression/decompression buffer overhead reported in #1430 can be verified by the reporter if necessary.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #1739 from XinyuZeng/gh-1430.

Authored-by: Xinyu Zeng <30414224+XinyuZeng@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 790a3e5)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to main/2.0 for Apache ORC 2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants