Skip to content

Conversation

@michel2323
Copy link
Collaborator

No description provided.

@michel2323
Copy link
Collaborator Author

@maleadt ocloc still segfault like in your past built #9631. Is that related to your open issue intel/compute-runtime#708? I don't see anyone else running into it. So I'll debug...

@michel2323
Copy link
Collaborator Author

intel/compute-runtime#838 Issue created. I think there's a device ID mismatch, although I think I used the recommended versions for the stack.

@maleadt
Copy link
Member

maleadt commented Jul 24, 2025

@maleadt ocloc still segfault like in your past built #9631. Is that related to your open issue intel/compute-runtime#708? I don't see anyone else running into it. So I'll debug...

Presumably, according to upstream it may be related to our build. Which isn't unthinkable...

@giordano
Copy link
Member

But I'd hope @michel2323 isn't using a system with C++03 ABI

@michel2323
Copy link
Collaborator Author

Calling compiled with GCC 11 as committed right now.

From outside the container:

╭─michel@intel-G501 ~/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx ‹ms/NEO●› 
╰─$ LD_LIBRARY_PATH=/home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx/x86_64-linux-gnu-libgfortran5-cxx11-debug+true/destdir/lib:/usr/lib/csl-glibc-x86_64:/home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx/srcdir/compute-runtime/build/bin /home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx/srcdir/compute-runtime/build/bin/ocloc-25.27.1 -q -file /home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx/srcdir/compute-runtime/build/bin/built_ins/x64/xe3_core/bindless_copy_kernel_timestamps_ptl.spv -spirv_input -device 30.3.0 -heapless_mode disable -force_stos_opt -64 -stateful_address_mode bindless -output bindless_copy_kernel_timestamps_30_3_0 -output_no_suffix -out_dir /home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx/srcdir/compute-runtime/build/bin/built_ins/x64/xe3_core -internal_options -cl-intel-use-bindless-mode\ -cl-intel-use-bindless-advanced-mode -options -cl-kernel-arg-info
╭─michel@intel-G501 ~/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/dllflCJx ‹ms/NEO●› 

No crash of ocloc

From inside the container (as here in buildkite):

sandbox:${WORKSPACE}/srcdir/compute-runtime/shared/source/built_ins/kernels # LD_LIBRARY_PATH=/workspace/x86_64-linux-gnu-libgfortran5-cxx11-debug+true/destdir/lib:/usr/lib/csl-glibc-x86_64:/works
pace/srcdir/compute-runtime/build/bin /workspace/srcdir/compute-runtime/build/bin/ocloc-25.27.1 -q -file /workspace/srcdir/compute-runtime/build/bin/built_ins/x64/xe3_core/bindless_copy_kernel_tim
estamps_ptl.spv -spirv_input -device 30.3.0 -heapless_mode disable -force_stos_opt -64 -stateful_address_mode bindless -output bindless_copy_kernel_timestamps_30_3_0 -output_no_suffix -out_dir /wo
rkspace/srcdir/compute-runtime/build/bin/built_ins/x64/xe3_core -internal_options -cl-intel-use-bindless-mode\ -cl-intel-use-bindless-advanced-mode -options -cl-kernel-arg-info
Segmentation fault (core dumped)

Segfault.

Tried using GCC 14. There I don't have to even pass arguments.

Outside container:

╭─michel@intel-G501 ~/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true ‹ms/NEO●› 
╰─$ export LD_LIBRARY_PATH="$PWD/VJJJt7yX/srcdir/compute-runtime/build/bin"
╭─michel@intel-G501 ~/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true ‹ms/NEO●› 
╰─$ ./VJJJt7yX/srcdir/compute-runtime/build/bin/ocloc-25.27.1     

... normal output

Inside the container:

sandbox:${WORKSPACE}/srcdir/compute-runtime # LD_LIBRARY_PATH=/usr/lib/csl-musl-x86_64:/usr/lib/csl-glibc-x86_64:/usr/local/lib64:/usr/local/lib:/usr/lib64:/usr/lib:/lib64:/lib:/workspace/x86_64-linux-musl-cxx11/destdir/lib:/workspace/x86_64-linux-musl-cxx11/destdir/lib64:/opt/x86_64-linux-musl/x86_64-linux-musl/lib64:/opt/x86_64-linux-musl/x86_64-linux-musl/lib:/opt/x86_64-linux-gnu/x86_64-linux-gnu/lib64:/opt/x86_64-linux-gnu/x86_64-linux-gnu/lib:/workspace/destdir/lib64:/workspace/destdir/lib::/workspace/srcdir/compute-runtime/build/bin /workspace/srcdir/compute-runtime/build/bin/ocloc-25.27.1
/workspace/srcdir/compute-runtime/build/bin/ocloc-25.27.1: /usr/lib/csl-glibc-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.31' not found (required by /workspace/srcdir/compute-runtime/build/bin/libocloc.so)
/workspace/srcdir/compute-runtime/build/bin/ocloc-25.27.1: /usr/lib/csl-glibc-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /workspace/srcdir/compute-runtime/build/bin/libocloc.so)

glibc mismatch.

How should I proceed here @maleadt @giordano ? I have to use GCC 11 due to C++20 features they use. I am not sure I want to backport that.

@maleadt
Copy link
Member

maleadt commented Jul 25, 2025

The glibc error when using GCC 14 is weird; @giordano I take it our CSL is too outdated to handle GCC 14 binaries? It's probably not common that a recipe executes binaries during the build, but we have to, and NEO does not support musl (so we can't bootstrap).

Simply bumping GCC is not going to help because even the GCC 11 binaries crash within the container while they work outside. I guess that bumping glibc would work; can we do that by installing Glibc_jll.jl?

@giordano
Copy link
Member

Updating the compiler libraries in the RootFS is what I was trying to do in JuliaPackaging/BinaryBuilderBase.jl#423, but got stuck with musl/glibc madness.

@maleadt
Copy link
Member

maleadt commented Jul 25, 2025

@michel2323 Can you try reverting to GCC 11 while replacing the glibc that's used for execution (i.e. the one in /lib I think, see ldd ocloc) with one from https://github.com/JuliaBinaryWrappers/Glibc_jll.jl/releases? This is assuming the libc difference is the culprit, of course.

@michel2323
Copy link
Collaborator Author

michel2323 commented Jul 25, 2025

I've started to play with it, but got stuck. Changing libc seems more involved than changing $LD_LIBRARY_PATH. I'll stop for today, but I think we're close. Where I got stuck is that C++ libraries are not part of Glibc_jll. Could that be? So it would call the c++ csl lib and then go down that chain.

@michel2323
Copy link
Collaborator Author

@maleadt I don't know why one segfaults and the other not. I've installed libc from Glibc_jll.

On my system, ocloc works fine with this linking.

╰─$ ldd /home/michel/git/Yggdrasil/N/NEO/build/x86_64-linux-gnu-cxx11-debug+true/kZY5Kf8C/srcdir/compute-runtime/build/bin/libocloc.so
        linux-vdso.so.1 (0x00007bc12cf88000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007bc12cf6b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007bc12cf66000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007bc12c600000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007bc12ce6f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007bc12ce42000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007bc12c200000)
        /lib64/ld-linux-x86-64.so.2 (0x00007bc12cf8a000)

In the container it fails:

sandbox:${WORKSPACE}/srcdir/compute-runtime/shared/source/built_ins/kernels # ldd /workspace/x86_64-linux-gnu-libgfortran5-cxx11-debug+true/artifacts/7b33755535302f3bb1a50dcc7443244d7d36
c5bd/lib/libigc.so.2.14.0+0 
        linux-vdso.so.1 (0x000073797d0f1000)
        librt.so.1 => /workspace/destdir/lib64/librt.so.1 (0x000073797d0e6000)
        libpthread.so.0 => /workspace/destdir/lib64/libpthread.so.0 (0x000073797d0e1000)
        libdl.so.2 => /workspace/destdir/lib64/libdl.so.2 (0x000073797d0dc000)
        libstdc++.so.6 => /usr/lib/csl-glibc-x86_64/libstdc++.so.6 (0x00007379723ec000)
        libm.so.6 => /workspace/destdir/lib64/libm.so.6 (0x000073797d002000)
        libgcc_s.so.1 => /usr/lib/csl-glibc-x86_64/libgcc_s.so.1 (0x000073797cfe5000)
        libc.so.6 => /workspace/destdir/lib64/libc.so.6 (0x00007379721f0000)
        ldd (0x000073797d0f3000)

I've checked objdump -p /usr/lib/csl-glibc-x86_64/libstdc++.so.6 | grep 'GLIBCXX' and I see no glaring incompatibility.

@amontoison amontoison added the oneAPI 1️⃣ Builders related to the oneAPI toolkit. label Jul 30, 2025
Comment on lines +42 to +43
# Need C++20
CMAKE_FLAGS+=(-DCMAKE_CXX_STANDARD=20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds like a patch for upstream. Downstream packagers should never set the C++ standard, that's the developers' business.

Copy link
Collaborator Author

@michel2323 michel2323 Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually set here: https://github.com/intel/compute-runtime/blob/d0fdeb0339afaa6db37411e10c41f291945aa727/CMakeLists.txt#L323 . So I'm a bit confused. I definitely get an error without explicitly setting it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is indeed confusing 😕

@michel2323
Copy link
Collaborator Author

@maleadt @giordano Again some interesting developments. I still get the segfault, no matter what libc and libstdc++ I use. However, upon further investigation, it seems that ocloc compiles successfully. It crashes when exiting. I'm starting to believe that this is a legitimate segfault, and that this system catches it.

@michel2323
Copy link
Collaborator Author

michel2323 commented Aug 5, 2025

@giordano @maleadt

ocloc segfaults in the container no matter what. I tried every combination and certainly the right one.

ocloc seems to crash while exiting. I checked, and it creates the required object files just fine. So I wrote a wrapper script that checks whether ocloc creates an output and if so, catches the segfault. Although dirty, I don't know any other way at this point. I think the cause is a double free, and it might just be lucky that they never run into this issue on Ubuntu 22.04 and 24.04, which are the only platforms they support. I tried gdb. It crashes in an anonymous function in a free. I have no useful stack.

@maleadt
Copy link
Member

maleadt commented Aug 5, 2025

That's really annoying... Thanks for looking into it though. If we do end up with the required files, I guess the workaround is fine as we never use ocloc at run time anyway.

@maleadt maleadt merged commit 4329292 into JuliaPackaging:master Aug 5, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oneAPI 1️⃣ Builders related to the oneAPI toolkit.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants