-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX backend crash on larger (?) input programs #70
Comments
Could be fixed in #97 |
This is hard to test; both on jizo (4090) and on my desktop (1050), the occasional cuda exceptions in the reproducer in the issue description above are absolutely drowned out by segfaults inside libcuda.so:
I can, very occasionally, reproduce a different error with
and I haven't seen that yet with the scan-syncthreads branch, but due to the low frequency that doesn't prove much. |
I just tried again. After a whole bunch of libcuda.so crashes, I tried running the executable under It does seem that running the program under The syncthreads call probably fixes a bug, but it clearly doesn't fix everything. |
I am submitting a...
Description
The PTX backend crashes or gives other unrecoverable errors (like a "CUDA Exception: Invalid argument") sometimes for larger (?) input programs.
In the repository https://github.com/tomsmeding/acc-gpu-crash (commit as of the time of writing: https://github.com/tomsmeding/acc-gpu-crash/tree/d3df383c685f19c5bba8d8a1959ee048c51b0361) there are two programs in the modules
N1
(smaller) andN2
(larger) that produce various kinds of crashes on different machines.N2
reproducibly crashes, whileN1
seems to run fine on certain machines while crashing on others. In the linked commit,Main.hs
runs the program fromN2
. BothN1
andN2
run fine in the interpreter.The repository includes a script
test.sh
that builds the program usingstack
and runs it undercuda-memcheck
until it returns with a non-zero exit code. (Environment variables for Jizo have been included.)For posterity, the source files are included here in spoilers.
N1.hs
N2.hs
Expected behaviour
No crash.
Current behaviour
Crash (non-deterministically on some machines).
Steps to reproduce (for bugs)
git clone https://github.com/tomsmeding/acc-gpu-crash
cd acc-gpu-crash
./test.sh
Your environment
nvidia-device-query
: see belownvidia-device-query
on my Arch linux machinenvidia-device-query
on Jizonvidia-device-query
on Robbert's Arch linux (Manjaro, really) machineThe text was updated successfully, but these errors were encountered: