-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ndrange
provided in KernelAbstractions kernels is broken
#283
Comments
printf debugging? You can always do |
KA also has an @print |
I think the issue comes from me trying to use CPU as a GPU backend in AcceleratedKernels.jl |
Are you using |
|
Okay, so that will be an issue for the default backend once JuliaGPU/KernelAbstractions.jl#556 lands. |
I'm not sure what you mean by this. Every GPU construct in KA.jl should be supported by the OpenCL back-end, even when used on the CPU. |
I meant it as in, some algorithms are written with defaults according to GPU specifications in mind. |
That's unexpected; without looking into the implementation, non-UB code should execute equally well on an actual GPU as it does on PoCL's CPU back-end. |
Thats unexpected for me too, but the implementation is not working on specific GPUs as well ( |
Okay, I think I know the issue. Its probably in OpenCL.jl using OpenCL, pocl_jll, KernelAbstractions
@kernel inbounds=true function _mwe!(@Const(v))
temp = @localmem Int8 (1,)
i = @index(Global, Linear)
@print i "\n"
@synchronize()
end
v = CLArray(rand(Float32, 10))
_mwe!(OpenCLBackend(), 256)(v, ndrange=length(v)) This prints 1...256. The CUDA version of the same code prints 1...10. using CUDA
b = CuArray(rand(Float32, 10))
_mwe!(CUDABackend(false, false), 256)(b, ndrange=length(b)) The issue is probably that Consequently, creating a CLArray of size 256 works without any issues, for the |
ndrange
in KernelAbstractions is broken
ndrange
in KernelAbstractions is brokenndrange
provided in KernelAbstractions kernels is broken
I think its multiple issues:
If the provided MWE doesn't make sense, or I misidentified the second issue, please let me know. |
Reduced to the common mistake of having barriers in divergent code: function foobar(n)
if get_local_id() <= n
OpenCL.@show get_global_id()
barrier()
end
return
end
@opencl local_size=16 global_size=16 foobar(10) The problem is that the barrier is placed in a divergent section of code, which is undefined behavior. According to the OpenCL™ 1.2 specification, section 3.4.3:
Same applies to CUDA's
@vchuravy This looks like a potential design flaw with KA.jl's implicit bounds checking? |
Urgh, yeah... Does that also mea:
Is verboten? The funny thing is that KA only supports top-level synchronize (it also has a conditional synchronize that is rarely used) So I could actually lower this correctly... |
Correct, unless If we want to keep the implicit bounds checking and fix this in lowering, I guess this means |
sigh yeah. Let me fix this on |
Actually, this has been fine since Volta: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x
Emphasis mine. Bailing out at the start of the kernel would exit the thread, meaning divergent synchronization on CUDA is fine. |
@VarLad could you try JuliaGPU/KernelAbstractions.jl#558 |
@vchuravy Trying with the latest commit on that PR, |
On the other hand, it makes the other issue worse.
After latest commit:
using KernelAbstractions, AcceleratedKernels, OpenCL
v = CLArray(rand(Float32, 10))
AcceleratedKernels.merge_sort!(v);
v
# or even this works:
AcceleratedKernels.merge_sort(v) |
Do you have some time to reduce this to a MWE, so that I don't have to dig into AcceleratedKernels? |
Does weekend do? I'm sorry for the delay but this week is a bit packed... By the way, this probably isn't a OpenCL.jl specific issue.
|
Ah that's a good hint. I think the error reporting on CUDA is more robust, that I can at least try, |
If you have a second can you run with |
sent! |
Thanks. This led me to JuliaGPU/AcceleratedKernels.jl#19 |
@vchuravy unfortunately, testing with the above AcceleratedKernels PR (and latest commit in the KernelAbstractions PR) still doesn't fix the issue. I get the same error. |
I'm unable to pinpoint at what point the error is occurring, as debugging KernelAbstractions kernels is different from normal Julia, so some guidance here would be helpful.The issue is that some elements of the output array are 0.I can't test this with any other driver other than PoCL either, since NVIDIA doesn't support SPIRV, so it would be nice if someone could test if it gives a different behavior with Intel drivers.This probably points to a problem in our current code somewhere. Behavior is same both before and after the USM PR.Edit:
Most recently:
Here's a reproducer:
This prints 1...256.
The CUDA version of the same code prints 1...10.
The issue is probably that ndrange is not working.
Consequently, creating a CLArray of size (multiple of) 256 works without any issues, for the
any
andall
functions, as well asmerge_sort
functionThe text was updated successfully, but these errors were encountered: