Fix segfaults when using CUDA #1397

aswild · 2024-12-09T18:01:54Z

Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated.

In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda.

Documentation for cuModuleLoadData states that its image argument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one.

I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked!

Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a --padd option to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.

This should resolve #1357

Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated. In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda. Documentation for cuModuleLoadData states that its `image` argument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one. I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked! Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a `--padd` option to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.

nilfm99 · 2024-12-09T19:05:09Z

Thanks for the contribution! @kylophone is this something you could easily test?

nilfm99 requested a review from kylophone December 9, 2024 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segfaults when using CUDA #1397

Fix segfaults when using CUDA #1397

aswild commented Dec 9, 2024

nilfm99 commented Dec 9, 2024

Fix segfaults when using CUDA #1397

Are you sure you want to change the base?

Fix segfaults when using CUDA #1397

Conversation

aswild commented Dec 9, 2024

nilfm99 commented Dec 9, 2024