-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV during join on large table #39332
Comments
A variation is to define right_keys = pa.array([0, 0], pa.int64()) Instead of a SIGSEGV, this results in an infinite loop with the following stack trace:
We're looping through the following three instructions:
Here I believe this corresponds to https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/light_array.cc#L329-L331. Suppose Oddly, the duplicate value in Replacing |
Took a look at this, here are my findings. First some trivial answers to related concerns:
Because only
The essentiality of this failure is the overflow of the offsets in the resulting var length buffer (detailed later). So it is necessary for the right table to contain enough duplicate keys to explode the join result, which will trigger the overflow. E.g., the left table has 128 8mb-sized string rows, so if the right table contains 4 (or more) matches per row in left, then the result will have 128 * 8mb * 4 = 4gb data, which will overflow to zero if using Now the detailed explanation of the segmentation fault (didn't look much on the infinite loop case, but I believe they have essentially the same cause): The result of the join will contain 128 * 4 = 512 rows. For the string column in the left table, each row has 8mb-sized string. So the result will have a string column with 8mb * 512 = 4gb-sized data. When arrow/cpp/src/arrow/compute/light_array.cc Line 594 in bcaeaa8
sum will overflow to 0 when reaching 4gb as it is uint32_t .
And then when allocating the buffer for the string data, the following code arrow/cpp/src/arrow/compute/light_array.cc Line 598 in bcaeaa8
basically does nothing because it sees offsets[512] == 0 and thinks this buffer needs no resizing, leaving the data buffer in its initial size align(8b + 64b, 64b) = 128b .
Later when appending string data to it, segmentation fault happens when it writes to its 128-th byte. I'm not sure if the API contract is upheld for this particular case, because for a string array whose element is 8mb each, simply >=512 rows will exceed the valid offset range ( But I think we should at least detect this kind of overflow and report an explicit error in this case. |
PR #39383 drafted to give explicit error for cases in this issue. I appreciate that any reviewer could help to see if this is the right thing to do. Thanks. |
@tmaxwell-anthropic As you mentioned:
This is another problem independent of what I discovered for the segmentation fault. I found this when I was debugging the UT for my fix. Now I've included the fix for the infinite loop in my PR as well. |
Great, thank you for looking into this! I think switching to |
I've formalized the PR and hope someone could review soon. |
… length data exceeds offset limit (int32 max) (#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: #39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
… length data exceeds offset limit (int32 max) (#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: #39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: #39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ng var length data exceeds offset limit (int32 max) (apache#39383) ### Rationale for this change When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: apache#39332 (comment). The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens. ### What changes are included in this PR? 1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw. 2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase. ### Are these changes tested? UT included. ### Are there any user-facing changes? No. * Closes: apache#39332 Lead-authored-by: zanmato <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Rossi(Ruoxi) Sun <zanmato1984@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
This Python script produces a segmentation fault in the
join()
call:The C++ call stack is:
Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.
If I change the
pa.string()
to apa.large_string()
then it works fine.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: