Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ROLLING_TEST branch 25.02 fails when built with static cuda runtime #18079

Closed
ttnghia opened this issue Feb 24, 2025 · 9 comments · Fixed by #18099
Closed

[BUG] ROLLING_TEST branch 25.02 fails when built with static cuda runtime #18079

ttnghia opened this issue Feb 24, 2025 · 9 comments · Fixed by #18099
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Feb 24, 2025

I ran cudf tests branch 25.02 on a GB100 node in a CUDA 12.8 container and got this failures:

[----------] 2 tests from GroupedRollingRangeOrderByFloatingPointTest/0, where TypeParam = float
[ RUN      ] GroupedRollingRangeOrderByFloatingPointTest/0.BoundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByFloatingPointTest/0.BoundedRanges, where TypeParam = float (0 ms)
[ RUN      ] GroupedRollingRangeOrderByFloatingPointTest/0.UnboundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByFloatingPointTest/0.UnboundedRanges, where TypeParam = float (0 ms)
[----------] 2 tests from GroupedRollingRangeOrderByFloatingPointTest/0 (0 ms total)

[----------] 2 tests from GroupedRollingRangeOrderByFloatingPointTest/1, where TypeParam = double
[ RUN      ] GroupedRollingRangeOrderByFloatingPointTest/1.BoundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByFloatingPointTest/1.BoundedRanges, where TypeParam = double (0 ms)
[ RUN      ] GroupedRollingRangeOrderByFloatingPointTest/1.UnboundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByFloatingPointTest/1.UnboundedRanges, where TypeParam = double (0 ms)
[----------] 2 tests from GroupedRollingRangeOrderByFloatingPointTest/1 (0 ms total)

[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/0, where TypeParam = numeric::fixed_point<int, (numeric::Radix)10>
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/0.BoundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/0.BoundedRanges, where TypeParam = numeric::fixed_point<int, (numeric::Radix)10> (0 ms)
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/0.UnboundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/0.UnboundedRanges, where TypeParam = numeric::fixed_point<int, (numeric::Radix)10> (0 ms)
[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/0 (0 ms total)

[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/1, where TypeParam = numeric::fixed_point<long, (numeric::Radix)10>
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/1.BoundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/1.BoundedRanges, where TypeParam = numeric::fixed_point<long, (numeric::Radix)10> (0 ms)
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/1.UnboundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/1.UnboundedRanges, where TypeParam = numeric::fixed_point<long, (numeric::Radix)10> (0 ms)
[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/1 (0 ms total)

[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/2, where TypeParam = numeric::fixed_point<__int128, (numeric::Radix)10>
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/2.BoundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/2.BoundedRanges, where TypeParam = numeric::fixed_point<__int128, (numeric::Radix)10> (0 ms)
[ RUN      ] GroupedRollingRangeOrderByDecimalTypedTest/2.UnboundedRanges
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByDecimalTypedTest/2.UnboundedRanges, where TypeParam = numeric::fixed_point<__int128, (numeric::Radix)10> (0 ms)
[----------] 2 tests from GroupedRollingRangeOrderByDecimalTypedTest/2 (0 ms total)

[----------] 8 tests from GroupedRollingRangeOrderByStringTest
[ RUN      ] GroupedRollingRangeOrderByStringTest.Ascending_Partitioned_NoNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Ascending_Partitioned_NoNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Ascending_NoParts_NoNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Ascending_NoParts_NoNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Ascending_Partitioned_WithNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Ascending_Partitioned_WithNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Ascending_NoParts_WithNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Ascending_NoParts_WithNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Descending_Partitioned_NoNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Descending_Partitioned_NoNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Descending_NoParts_NoNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Descending_NoParts_NoNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Descending_Partitioned_WithNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Descending_Partitioned_WithNulls (0 ms)
[ RUN      ] GroupedRollingRangeOrderByStringTest.Descending_NoParts_WithNulls
unknown file: Failure
C++ exception with description "parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal" thrown in the test body.
[  FAILED  ] GroupedRollingRangeOrderByStringTest.Descending_NoParts_WithNulls (0 ms)
[----------] 8 tests from GroupedRollingRangeOrderByStringTest (0 ms total)

[----------] 3 tests from GroupedRollingErrorTest
[ RUN      ] GroupedRollingErrorTest.NegativeMinPeriods
[       OK ] GroupedRollingErrorTest.NegativeMinPeriods (0 ms)
[ RUN      ] GroupedRollingErrorTest.EmptyInput
[       OK ] GroupedRollingErrorTest.EmptyInput (0 ms)
[ RUN      ] GroupedRollingErrorTest.SumTimestampNotSupported
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:510: Failure
Expected: cudf::grouped_rolling_window( grouping_keys, input_D, 2, 2, 0, *cudf::make_sum_aggregation<cudf::rolling_aggregation>()) throws an exception of type cudf::logic_error.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:514: Failure
Expected: cudf::grouped_rolling_window( grouping_keys, input_s, 2, 2, 0, *cudf::make_sum_aggregation<cudf::rolling_aggregation>()) throws an exception of type cudf::logic_error.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:518: Failure
Expected: cudf::grouped_rolling_window( grouping_keys, input_ms, 2, 2, 0, *cudf::make_sum_aggregation<cudf::rolling_aggregation>()) throws an exception of type cudf::logic_error.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:522: Failure
Expected: cudf::grouped_rolling_window( grouping_keys, input_us, 2, 2, 0, *cudf::make_sum_aggregation<cudf::rolling_aggregation>()) throws an exception of type cudf::logic_error.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:526: Failure
Expected: cudf::grouped_rolling_window( grouping_keys, input_ns, 2, 2, 0, *cudf::make_sum_aggregation<cudf::rolling_aggregation>()) throws an exception of type cudf::logic_error.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
[  FAILED  ] GroupedRollingErrorTest.SumTimestampNotSupported (0 ms)
[----------] 3 tests from GroupedRollingErrorTest (0 ms total)

[----------] 4 tests from GroupedRollingTest/0, where TypeParam = signed char
[ RUN      ] GroupedRollingTest/0.SimplePartitionedStaticWindowsWithGroupKeys
/rapids/cudf/cpp/tests/rolling/grouped_rolling_test.cpp:131: Failure
Expected: output = cudf::grouped_rolling_window( keys, input, preceding_window, following_window, min_periods, op) doesn't throw an exception.
  Actual: it throws thrust::system::system_error with description "parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle".
CMake Error at /rapids/cudf/cpp/build/rapids-cmake/run_gpu_test.cmake:34 (execute_process):
  execute_process failed command indexes:

    1: "Abnormal exit with child return code: Segmentation fault"
@ttnghia ttnghia added the bug Something isn't working label Feb 24, 2025
@ttnghia ttnghia changed the title [BUG] ROLLING_TEST branch 25.02 fails on GB100 [BUG] ROLLING_TEST branch 25.02 fails on GB100 with CUDA 12.8 Feb 24, 2025
@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 24, 2025

Note that I didn't run the tests with other CUDA versions on GB100, nor running on other similar chips with CUDA 12.8.

@davidwendt
Copy link
Contributor

Can you provide the CUDA driver version?
Some googling makes this sound like a mismatched CUDA driver.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 24, 2025

Here it is: Driver Version: 570.86.15 CUDA Version: 12.8.

@jakirkham
Copy link
Member

That looks like the same version that shipped in CUDA 12.8 according to the release notes

@mythrocks
Copy link
Contributor

mythrocks commented Feb 25, 2025

I should mention: the window-functions kernels that service the floating-point-range window functions are the same ones that deal with the integral types. Those have been unchanged for years.

(Note that these are being reworked in #17807.)

@wence-
Copy link
Contributor

wence- commented Feb 25, 2025

Can you run these under compute-sanitizer?

@robertmaynard robertmaynard changed the title [BUG] ROLLING_TEST branch 25.02 fails on GB100 with CUDA 12.8 [BUG] ROLLING_TEST branch 25.02 fails when built with static cuda runtime Feb 25, 2025
@robertmaynard
Copy link
Contributor

This isn't a blackwell issue. I can reproduce the failure on any hardware when using a static cudart with CUDA 12.8 or 12.6

Current static trace of the error from compute-sanitizer:

[ RUN      ] TypedCollectListTest/0.BasicRollingWindow
========= Program hit cudaErrorInvalidResourceHandle (error 400) due to "invalid resource handle" on CUDA API call to cudaFuncGetAttributes.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame: cub::CUB_200700_750_NS::PtxVersion(int&, int) [0xfbfd06] in ROLLING_TEST
=========         Host Frame: cudf::test::(anonymous namespace)::generate_all_row_indices(int) [0xf66727] in ROLLING_TEST
=========         Host Frame: cudf::test::detail::expect_columns_equivalent(cudf::column_view const&, cudf::column_view const&, cudf::test::debug_output_level, int) [0xfb518d] in ROLLING_TEST
=========         Host Frame: TypedCollectListTest_BasicRollingWindow_Test<signed char>::TestBody() [0x462fbe] in ROLLING_TEST
=========         Host Frame: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::test::*)(), char const*) [0x1012550] in ROLLING_TEST
=========         Host Frame: testing::test::Run() [0xffa925] in ROLLING_TEST
=========         Host Frame: testing::TestInfo::Run() [0xffaae4] in ROLLING_TEST
=========         Host Frame: testing::TestSuite::Run() [0xffac04] in ROLLING_TEST
=========         Host Frame: testing::internal::UnitTestImpl::RunAllTests() [0x1007f0b] in ROLLING_TEST
=========         Host Frame: testing::UnitTest::Run() [0xffadce] in ROLLING_TEST
=========         Host Frame: main [0x17c6de] in ROLLING_TEST
========= 

@mythrocks mythrocks self-assigned this Feb 25, 2025
@mythrocks
Copy link
Contributor

Updating here on what @robertmaynard uncovered: It looks like an issue with the tests.

@robertmaynard found that the failing tests are using static initialization for device vectors used as inputs for the test. This causes the CUDA runtime context to be used before it is constructed (or after it is destroyed).

I'll go over the window-function tests to see where (all) this might be the case.

@robertmaynard
Copy link
Contributor

The root issue is the OffsetRowWindowTest class in the offset_row_window_test.cpp, specifically the static _keys and _values member variables.

Since those are static member they get initialized at library/executable load time which happens before the CUDA runtime is fully setup. Since they are cudf::test::fixed_width_column_wrapper types they require the CUDA runtime to be fully setup before construciton. Since static initialization order in C++ is not defined this issue went undetected till the compiler/linker/whatever changed the order of loads.

My initial thought is to move these two member variables to rolling_runner as non static types.

mythrocks added a commit to mythrocks/cudf that referenced this issue Feb 26, 2025
Fixes rapidsai#18079.

This commit fixes the failures reported in rapidsai#18079, where the use of static
column vector objects in the tests causes the use of a CUDA runtime context
before it's been initialized, causing the tests to fail with:
```
parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle
```

Signed-off-by: MithunR <mithunr@nvidia.com>
mythrocks added a commit to mythrocks/cudf that referenced this issue Feb 26, 2025
Fixes rapidsai#18079.

This commit fixes the failures reported in rapidsai#18079, where the use of static
column vector objects in the tests causes the use of a CUDA runtime context
before it's been initialized, causing the tests to fail with:
```
parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle
```

Signed-off-by: MithunR <mithunr@nvidia.com>
rapids-bot bot pushed a commit that referenced this issue Feb 26, 2025
Fixes #18079.

This commit fixes the failures reported in #18079, where the use of static column vector objects in the tests causes the use of a CUDA runtime context before it's been initialized, causing the tests to fail with:
```
parallel_for failed: cudaErrorInvalidResourceHandle: invalid resource handle
```

The solution is to switch the static column vectors to runtime, as a member of the test utility class `rolling_runner`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #18099
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants