Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ndrange provided in KernelAbstractions kernels is broken #283

Open
VarLad opened this issue Jan 24, 2025 · 27 comments
Open

ndrange provided in KernelAbstractions kernels is broken #283

VarLad opened this issue Jan 24, 2025 · 27 comments

Comments

@VarLad
Copy link
Contributor

VarLad commented Jan 24, 2025

I'm unable to pinpoint at what point the error is occurring, as debugging KernelAbstractions kernels is different from normal Julia, so some guidance here would be helpful.
The issue is that some elements of the output array are 0.

I can't test this with any other driver other than PoCL either, since NVIDIA doesn't support SPIRV, so it would be nice if someone could test if it gives a different behavior with Intel drivers.
This probably points to a problem in our current code somewhere. Behavior is same both before and after the USM PR.

Edit:

Most recently:

Here's a reproducer:

using OpenCL, pocl_jll, KernelAbstractions

@kernel inbounds=true function _mwe!(@Const(v))
           temp = @localmem Int8 (1,)
           i = @index(Global, Linear)
           @print i "\n"
           @synchronize()
       end

v = CLArray(rand(Float32, 10))

_mwe!(OpenCLBackend(), 256)(v, ndrange=length(v))

This prints 1...256.

The CUDA version of the same code prints 1...10.

using CUDA

b = CuArray(rand(Float32, 10))

_mwe!(CUDABackend(false, false), 256)(b, ndrange=length(b))

The issue is probably that ndrange is not working.

Consequently, creating a CLArray of size (multiple of) 256 works without any issues, for the any and all functions , as well as merge_sort function

@maleadt
Copy link
Member

maleadt commented Jan 24, 2025

debugging KernelAbstractions kernels is different from normal Julia

printf debugging? You can always do Main.OpenCL.@printf or something from KA.jl/AK.jl when you have OpenCL imported in your main environment (even though KA.jl/AK.jl doesn't depend on it, hence going through Main).

@vchuravy
Copy link
Member

KA also has an @print

@VarLad
Copy link
Contributor Author

VarLad commented Jan 27, 2025

I think the issue comes from me trying to use CPU as a GPU backend in AcceleratedKernels.jl
Also, I just saw JuliaGPU/KernelAbstractions.jl#556
I'll look into the issue again when the above PR lands :)

@vchuravy
Copy link
Member

I think the issue comes from me trying to use CPU as a GPU backend in AcceleratedKernels.jl

Are you using CPU or OpenCLBackend

@VarLad
Copy link
Contributor Author

VarLad commented Jan 27, 2025

OpenCLBackend with POCL on CPU

@vchuravy
Copy link
Member

Okay, so that will be an issue for the default backend once JuliaGPU/KernelAbstractions.jl#556 lands.

@maleadt
Copy link
Member

maleadt commented Jan 27, 2025

I think the issue comes from me trying to use CPU as a GPU backend in AcceleratedKernels.jl

I'm not sure what you mean by this. Every GPU construct in KA.jl should be supported by the OpenCL back-end, even when used on the CPU.

@VarLad
Copy link
Contributor Author

VarLad commented Jan 27, 2025

I meant it as in, some algorithms are written with defaults according to GPU specifications in mind.
For example, I was able to make the functions in this file work by setting cooperation to false to use a less optimized implementation but one which gave correct results for POCL on CPU: https://github.com/JuliaGPU/AcceleratedKernels.jl/blob/main/src/truth.jl#L51
Without setting it to false, the functions any and all give incorrect outputs.

@maleadt
Copy link
Member

maleadt commented Jan 27, 2025

That's unexpected; without looking into the implementation, non-UB code should execute equally well on an actual GPU as it does on PoCL's CPU back-end.

@VarLad
Copy link
Contributor Author

VarLad commented Jan 27, 2025

Thats unexpected for me too, but the implementation is not working on specific GPUs as well (Intel UHD 620 integrated graphics cards and some other Intel GPUs as stated in the docstring I linked above), for which the flag exists. The flag is set to false for oneAPI because global writes hangs on some Intel GPUs apparently.

@VarLad
Copy link
Contributor Author

VarLad commented Jan 27, 2025

Okay, I think I know the issue. Its probably in OpenCL.jl
Here's a reproducer:

using OpenCL, pocl_jll, KernelAbstractions

@kernel inbounds=true function _mwe!(@Const(v))
           temp = @localmem Int8 (1,)
           i = @index(Global, Linear)
           @print i "\n"
           @synchronize()
       end

v = CLArray(rand(Float32, 10))

_mwe!(OpenCLBackend(), 256)(v, ndrange=length(v))

This prints 1...256.

The CUDA version of the same code prints 1...10.

using CUDA

b = CuArray(rand(Float32, 10))

_mwe!(CUDABackend(false, false), 256)(b, ndrange=length(b))

The issue is probably that ndrange is not working.

Consequently, creating a CLArray of size 256 works without any issues, for the any and all functions , as well as merge_sort function.

@VarLad VarLad changed the title Broken when used with AcceleratedKernels.jl ndrange in KernelAbstractions is broken Jan 27, 2025
@VarLad VarLad changed the title ndrange in KernelAbstractions is broken ndrange provided in KernelAbstractions kernels is broken Jan 27, 2025
@VarLad
Copy link
Contributor Author

VarLad commented Jan 28, 2025

I think its multiple issues:

  • merge_sort actually works on Floats of all sizes just fine, but Ints are an issue, where it replaces some elements by 0. In case of CLArray of Integers of size 256, I notice that the elements are not 0, but the sorting is still wrong. The specific issue still needs to be pinpointed, although I'm guessing that issue is something similar to the second point.
  • ndrange not working how it should, is still an issue, irrespective of types, and is leading to wrong output (in case of any and all functions). Also, irrespective of types, sizes of 256 "just work" for the above two functions.

If the provided MWE doesn't make sense, or I misidentified the second issue, please let me know.

@maleadt
Copy link
Member

maleadt commented Jan 28, 2025

Reduced to the common mistake of having barriers in divergent code:

function foobar(n)
    if get_local_id() <= n
        OpenCL.@show get_global_id()
        barrier()
    end
    return
end

@opencl local_size=16 global_size=16 foobar(10)

The problem is that the barrier is placed in a divergent section of code, which is undefined behavior. According to the OpenCL™ 1.2 specification, section 3.4.3:

[T]he work-group barrier must be encountered by all workitems of a work-group executing the kernel or by none at all.

Same applies to CUDA's _syncthreads/bar:

In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge).

@vchuravy This looks like a potential design flaw with KA.jl's implicit bounds checking?

@vchuravy
Copy link
Member

Urgh, yeah...

Does that also mea:

cond && return
synchronize

Is verboten? The funny thing is that KA only supports top-level synchronize (it also has a conditional synchronize that is rarely used)

So I could actually lower this correctly...

@maleadt
Copy link
Member

maleadt commented Jan 28, 2025

Does that also mea:

cond && return
synchronize

Is verboten?

Correct, unless cond is warp-uniform.

If we want to keep the implicit bounds checking and fix this in lowering, I guess this means @kernel is to stay...

@vchuravy
Copy link
Member

If we want to keep the implicit bounds checking and fix this in lowering, I guess this means @kernel is to stay...

sigh yeah. Let me fix this on 0.9 for now, and then we have to take a hard-look at semantics in for v1.0

@maleadt
Copy link
Member

maleadt commented Jan 28, 2025

Same applies to CUDA's _syncthreads/bar:

In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge).

Actually, this has been fine since Volta: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.

Emphasis mine. Bailing out at the start of the kernel would exit the thread, meaning divergent synchronization on CUDA is fine.

@vchuravy
Copy link
Member

vchuravy commented Feb 3, 2025

@VarLad could you try JuliaGPU/KernelAbstractions.jl#558

@VarLad
Copy link
Contributor Author

VarLad commented Feb 3, 2025

@vchuravy Trying with the latest commit on that PR,
One of the issues has been solved, the AcceleratedKernels.any and AcceleratedKernels.all functions work and ndrange parameter works.

@VarLad
Copy link
Contributor Author

VarLad commented Feb 3, 2025

On the other hand, it makes the other issue worse.
Before this PR:

AccleratedKernels.merge_sort actually works on Floats of all sizes just fine, but Ints are an issue, where it replaces some elements by 0. In case of CLArray of Integers of size 256, I notice that the elements are not 0, but the sorting is still wrong. The specific issue still needs to be pinpointed, although I'm guessing that issue is something similar to the second point.

After latest commit:

  • AcceleratedKernels.merge_sort just hangs indefinitely no matter the type of input (Float, Int). I can't Ctrl + C julia either so I now have to kill the process. It seems that the issue happens during print time. This is the MWE:
using KernelAbstractions, AcceleratedKernels, OpenCL
 v = CLArray(rand(Float32, 10))
AcceleratedKernels.merge_sort!(v);                                                                                                                                   
v
# or even this works:
AcceleratedKernels.merge_sort(v)                          

@vchuravy
Copy link
Member

vchuravy commented Feb 3, 2025

Do you have some time to reduce this to a MWE, so that I don't have to dig into AcceleratedKernels?

@VarLad
Copy link
Contributor Author

VarLad commented Feb 3, 2025

Do you have some time to reduce this to a MWE, so that I don't have to dig into AcceleratedKernels?

Does weekend do? I'm sorry for the delay but this week is a bit packed...

By the way, this probably isn't a OpenCL.jl specific issue.
Running the same with CUDA.jl,

ERROR: a undefined variable error was thrown during kernel execution on thread (129, 1, 1) in block (1, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

Error showing value of type CuArray{Float32, 1, CUDA.DeviceMemory}:
ERROR: KernelException: exception thrown during kernel execution on device NVIDIA GeForce RTX 3050 Laptop GPU

@vchuravy
Copy link
Member

vchuravy commented Feb 3, 2025

Ah that's a good hint. I think the error reporting on CUDA is more robust, that I can at least try,

@vchuravy
Copy link
Member

vchuravy commented Feb 3, 2025

If you have a second can you run with CUDA.@device_code dir="debug" AcceleratedKernels.merge_sort!(v); and send me a tarball of the directory "debug" on slack?

@VarLad
Copy link
Contributor Author

VarLad commented Feb 3, 2025

sent!

@vchuravy
Copy link
Member

vchuravy commented Feb 4, 2025

Thanks. This led me to JuliaGPU/AcceleratedKernels.jl#19

@VarLad
Copy link
Contributor Author

VarLad commented Feb 5, 2025

@vchuravy unfortunately, testing with the above AcceleratedKernels PR (and latest commit in the KernelAbstractions PR) still doesn't fix the issue. I get the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants