-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ROLLING_TEST
branch 25.02 fails when built with static cuda runtime
#18079
Comments
ROLLING_TEST
branch 25.02 fails on GB100ROLLING_TEST
branch 25.02 fails on GB100 with CUDA 12.8
Note that I didn't run the tests with other CUDA versions on GB100, nor running on other similar chips with CUDA 12.8. |
Can you provide the CUDA driver version? |
Here it is: |
That looks like the same version that shipped in CUDA 12.8 according to the release notes |
I should mention: the window-functions kernels that service the floating-point-range window functions are the same ones that deal with the integral types. Those have been unchanged for years. (Note that these are being reworked in #17807.) |
Can you run these under compute-sanitizer? |
ROLLING_TEST
branch 25.02 fails on GB100 with CUDA 12.8ROLLING_TEST
branch 25.02 fails when built with static cuda runtime
This isn't a blackwell issue. I can reproduce the failure on any hardware when using a static cudart with CUDA 12.8 or 12.6 Current static trace of the error from compute-sanitizer:
|
Updating here on what @robertmaynard uncovered: It looks like an issue with the tests. @robertmaynard found that the failing tests are using static initialization for device vectors used as inputs for the test. This causes the CUDA runtime context to be used before it is constructed (or after it is destroyed). I'll go over the window-function tests to see where (all) this might be the case. |
The root issue is the Since those are static member they get initialized at library/executable load time which happens before the CUDA runtime is fully setup. Since they are My initial thought is to move these two member variables to |
Fixes rapidsai#18079. This commit fixes the failures reported in rapidsai#18079, where the use of static column vector objects in the tests causes the use of a CUDA runtime context before it's been initialized, causing the tests to fail with: ``` parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle ``` Signed-off-by: MithunR <mithunr@nvidia.com>
Fixes rapidsai#18079. This commit fixes the failures reported in rapidsai#18079, where the use of static column vector objects in the tests causes the use of a CUDA runtime context before it's been initialized, causing the tests to fail with: ``` parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle ``` Signed-off-by: MithunR <mithunr@nvidia.com>
Fixes #18079. This commit fixes the failures reported in #18079, where the use of static column vector objects in the tests causes the use of a CUDA runtime context before it's been initialized, causing the tests to fail with: ``` parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle ``` The solution is to switch the static column vectors to runtime, as a member of the test utility class `rolling_runner`. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #18099
I ran cudf tests branch 25.02 on a GB100 node in a CUDA 12.8 container and got this failures:
The text was updated successfully, but these errors were encountered: