Fix segfaults when using CUDA#1397
Conversation
Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated. In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda. Documentation for cuModuleLoadData states that its `image` argument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one. I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked! Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a `--padd` option to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.
|
Thanks for the contribution! @kylophone is this something you could easily test? |
|
I have the same issues as described in #1357 using the latest Nvidia driver and CUDA and this fix is working for me. If testing is a blocker for this PR, I'm sharing the tests I've done to move this forward. On master both running vmaf_cuda using ffmpeg and vmaf's cuda unit tests are crashing due to Tested this on: NVIDIA GeForce RTX 3060, Driver Version: 570.86.16, CUDA Version: 12.8 With the fix (rebased to upstream master) the unit tests are passing: ffmpeg: Without the fix ffmpeg is crashing at With the fix.: |
|
@aswild just to make sure, is |
|
Actually, nvm, it's looking like it's not since that bin2c does not support a lot of arguments. |
|
It seems there's a lot of implementation of |
|
@1480c1 yeah This part of the meson configuration step is only enabled when the user explicitly builds with |
Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated.
In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda.
Documentation for cuModuleLoadData states that its
imageargument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one.I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked!
Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a
--paddoption to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.This should resolve #1357