Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal Instruction vgatherpd from parallel_for #17767

Open
will-saunders-ukaea opened this issue Apr 1, 2025 · 3 comments
Open

Illegal Instruction vgatherpd from parallel_for #17767

will-saunders-ukaea opened this issue Apr 1, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@will-saunders-ukaea
Copy link

will-saunders-ukaea commented Apr 1, 2025

Describe the bug

Hello,

I am not sure if this is the correct place to raise this issue, please feel free to point me to a more appropriate place if required. i.e. is here [1] also the Intel OpenCL CPU runtime repository (only GPU is mentioned)?

We have SYCL parallel_fors, which unfortunately are hard to extract out of our code base to produce an MFE (we have our own C++ templated loop abstraction on top of SYCL, see here for an example [2]), that produces illegal instruction errors with oneapi. I am happy to extract some SPIR-V and provide it if someone can point me at the incantation/rough guide to do this (on the assumption that the issue is in Intel OpenCL and the SPIR-V is sufficient to reproduce?).

These loops run correctly with:

  • AdaptiveCpp, OpenMP library mode, OpenMP llvm accelerated, CUDA through LLVM, CUDA through nvcxx and generic backends.
  • OneAPI 2024.2.1 (and I think 2025.0.0): Intel OpenCL (3.0 (Build 0) [2024.18.7.0.11_160000]) with CL_CONFIG_CPU_TARGET_ARCH=corei7-avx or older.

These loops fail when:

  • OneAPI OneAPI 2024.2.1 (and I think 2025.0.0) with Intel OpenCL (OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]) CL_CONFIG_CPU_TARGET_ARCH=core-avx2 or newer.

The symptom is an Illegal instruction that gdb-oneapi points to a vpgatherqq/vpgatherpqd. Screenshots attached from two completely different machines, one an Intel Raptor Lake the other an AMD Zen4. On both machines the instruction has an duplication of registers in the arguments which googling suggests is invalid? I do not know enough avx2/avx512 assembly to verify that the duplicate register is actually the problem.

[1] https://github.com/intel/compute-runtime
[2] https://excalibur-neptune.github.io/NESO-Particles/dev/sphinx/html/concept/particle_loop.html#advection-example

Image
Image
Image

To reproduce

Extracting a MFE has been challenging, I am happy to run tools and provide more output if someone can point me at what to run. i.e. is there a way I can dump the SPIR-V for the failing kernels?

Environment

$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/spack-v0.23/linux-ubuntu24.04-zen4/gcc-13.2.0/intel-oneapi-compilers-2024.2.1-j6iedmjftsc72gnksdvipiphofyez3ut/compiler/2024.2/bin/compiler
Configuration file: /opt/spack-v0.23/linux-ubuntu24.04-zen4/gcc-13.2.0/intel-oneapi-compilers-2024.2.1-j6iedmjftsc72gnksdvipiphofyez3ut/compiler/2024.2/bin/compiler/../icpx.cfg

$ ONEAPI_DEVICE_SELECTOR=opencl:cpu sycl-ls --verbose
INFO: Output filtered by ONEAPI_DEVICE_SELECTOR environment variable, which is set to opencl:cpu.
To see device ids, use the --ignore-device-selectors CLI option.

[opencl:cpu] Intel(R) OpenCL, AMD Ryzen Threadripper 7970X 32-Cores           OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]

Platforms: 1
Platform [#1]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Type       : cpu
        Version    : OpenCL 3.0 (Build 0)
        Name       : AMD Ryzen Threadripper 7970X 32-Cores
        Vendor     : Intel(R) Corporation
        Driver     : 2024.18.7.0.11_160000
        Aspects    : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group
        info::device::sub_group_sizes: 4 8 16 32 64
default_selector()      : cpu, Intel(R) OpenCL, AMD Ryzen Threadripper 7970X 32-Cores           OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : cpu, Intel(R) OpenCL, AMD Ryzen Threadripper 7970X 32-Cores           OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
gpu_selector()          : No device of requested type available. Please chec...
custom_selector(gpu)    : No device of requested type available. Please chec...
custom_selector(cpu)    : cpu, Intel(R) OpenCL, AMD Ryzen Threadripper 7970X 32-Cores           OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
custom_selector(acc)    : No device of requested type available. Please chec...

Additional context

No response

@will-saunders-ukaea will-saunders-ukaea added the bug Something isn't working label Apr 1, 2025
@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Apr 4, 2025

@qichaogu , @wenju-he , do you know who would be the best person to look into this? I think OCL CPU RT should start this investigation.

@0x12CC
Copy link
Contributor

0x12CC commented Apr 9, 2025

Extracting a MFE has been challenging, I am happy to run tools and provide more output if someone can point me at what to run. i.e. is there a way I can dump the SPIR-V for the failing kernels?

You can try compiling with the -fsycl-device-only to get the kernel sources:

clang++ -fsycl -fsycl-device-only <source.cpp>

This should produce an LLVM bitcode file for each kernel. You can then run them through llvm-dis to get the textual representation.

@will-saunders-ukaea
Copy link
Author

will-saunders-ukaea commented Apr 10, 2025

Maybe I have done something useful...

I found the "Intercept Layer for OpenCL" project and ran

$ CLI_OpenCLFileName=/opt/spack-v0.23/linux-ubuntu24.04-zen4/gcc-13.2.0/intel-oneapi-compilers-2024.2.1-j6iedmjftsc72gnksdvipiphofyez3ut/compiler/2024.2/lib/libOpenCL.so LD_PRELOAD=/home/js0259/git-ukaea/opencl-intercept-layer/build/intercept/libOpenCL.so.1.2 gdb-oneapi -ex run --args /home/js0259/git-ukaea/opencl-intercept-layer/build/cliloader/cliloader --dump-source --dump-spirv --dump-kernel-isa-binaries --dump-output-binaries test/test_external_common

where test_external_common is my program that fails.

I observe in gdb-oneapi the illegal instruction (sorry for screenshots)

Image

I then took the dumped objects and ran them though the dissembler provided with the intercept layer:

$ /home/js0259/git-ukaea/opencl-intercept-layer/scripts/disassemble_cpu.sh CLI_0000_1611CE8F_0000_9942277F_CPU.bin CLI_0000_1611CE8F_0000_9942277F_CPU.isabin CLI_0000_1611CE8F_0000_9942277F_CPU.isa

I observe in CLI_0000_1611CE8F_0000_9942277F_CPU.isa, on exactly line 18000, the assembly seen in gdb:

   13da4:	62 d1 fe 4c 7f 50 02 	vmovdqu64 %zmm2,0x80(%r8){%k4}
   13dab:	62 d1 fe 4a 7f 50 03 	vmovdqu64 %zmm2,0xc0(%r8){%k2}
   13db2:	62 f1 ed 48 72 e1 20 	vpsraq $0x20,%zmm1,%zmm2
   13db9:	c5 f1 ef c9          	vpxor  %xmm1,%xmm1,%xmm1
   13dbd:	62 f2 fd 49 93 04 05 	vgatherqpd 0x0(,%zmm0,1)/(bad),%zmm0{%k1}
   13dc4:	00 00 00 00
   13dc8:	c4 e1 f8 90 ca       	kmovq  %k2,%k1
   13dcd:	62 f2 fd 49 93 0c 05 	vgatherqpd 0x0(,%zmm0,1),%zmm1{%k1}
   13dd4:	00 00 00 00
   13dd8:	62 f1 fd 4a 11 4c 24 	vmovupd %zmm1,0x340(%rsp){%k2}
   13ddf:	0d
   13de0:	62 f1 fd 4c 11 44 24 	vmovupd %zmm0,0x300(%rsp){%k4}

Hopefully the following files are attached in the zip:

-rw-rw-r-- 1 js0259 js0259 6.8M Apr 10 19:49 CLI_0000_1611CE8F_0000_9942277F_CPU.bin
-rw-rw-r-- 1 js0259 js0259 1.5M Apr 10 19:53 CLI_0000_1611CE8F_0000_9942277F_CPU.isa
-rw-rw-r-- 1 js0259 js0259 3.6M Apr 10 19:53 CLI_0000_1611CE8F_0000_9942277F_CPU.isabin
-rw-rw-r-- 1 js0259 js0259    3 Apr 10 19:49 
CLI_0000_1611CE8F_0000_9942277F_options.txt
-rw-rw-r-- 1 js0259 js0259 5.2M Apr 10 19:49 CLI_0000_1611CE8F_0000.spv

test_external_common.zip

Edit: I realize I had CL_CONFIG_CPU_TARGET_ARCH unset, so I don't actually know which architecture is used for my machine (zen4). Here is the cpuinfo for reference:

processor	: 63
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 24
model name	: AMD Ryzen Threadripper 7970X 32-Cores
stepping	: 1
microcode	: 0xa108105
cpu MHz		: 1720.012
cache size	: 1024 KB
physical id	: 0
siblings	: 64
core id		: 31
cpu cores	: 32
apicid		: 63
initial apicid	: 63
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips	: 7987.36
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 52 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

Please let me know if providing more or different output is useful.
Will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants