-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfx1103 (7840U): HW Exception by GPU node-1 #141
Comments
Some ideas. There are a bunch of environment variables you can set to enable debugging, the most prominent of which is I'm not familiar with how the setup on mobile works, but try to ensure that the GPU you're doing compute on is not also the one driving the primary display. This should work fine (generally the worst that can happen is running out of memory, not a crash), but it's still noise you're better off without. If some fancy desktop compositor is tickling the card in a way that's not appropriately combined with compute, this could cause the driver to choke. That would still generally be a driver bug as that should never trigger a reset, but such things do happen. Lastly, although this doesn't really sound like a hardware problem given the random nature of the crash, you can still try installing |
Thanks so much for the reply. I will definitely dive-in with the tools you mentioned tomorrow As for primary display, it's being run in server mode. I'm ssh-ing in to run console commands And while trying to make a minimum demonstrating code, I was able to crank the compute to the max. I even used the same model and fft functions that are at the heart of my application and couldn't get it to crash. But the rest of the code is much more complex. So while I'm coming at it from both directions, it's still taking me a long time to unravel and isolate pieces of code to test |
It sounds like one or the other piece might be introducing some kind of memory corruption or resource exhaustion that then catches up with the other operations. Unfortunately such things are notoriously hard to debug, since the offending operation isn't necessarily the one that crashes. However, if you have cases where it crashes right up front, these should at least minimize the amount of logging/tracing you have to trawl through, it's just a matter of retrying. |
One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes If you have some code that you could share that will very likely trigger the problem, that would help the testing. I have a feeling that if the problem persist even with the latest kernel, the problem can be either on the kernel side of code or on the userspace code that communicates with the kernel for sending there code and receiving responses. I may have somewhere some old notes for tracing similar problems when I traced long time ago some similar type of problems with 2400g/vega apu. |
Installed the latest kernel. No luck Turned on logging. This is the error level log. The GPU Hang doesn't appear in the error log when it happens. I'm still parsing through the "everything" log. But maybe something jumps out at you.
I'm still having trouble isolating the problem even just to collect a log from a single command that hangs (otherwise its megabytes of text). But I'm still working on it. I'll send some code when I finally get it down to a reasonable enough length to be readable |
Well it's good to know that the fix is not there in new kernel. |
I've had kernels and other ubuntu versions fully lock up. On the versions I'm using now, the GPU is able to recover. Though of course the full python process is killed |
Ok, here are two Level 4 logs. One in which the crash occurs almost immediately, and another which gets past the crash point (without the stuff passed the crash point). I'm looking through them now Let me know if you think it would be useful for me to go through and match the two logs line by line |
I ended up going through matching the log anyway. Here's the Google Sheet with the comparison: The two match up pretty substantially. Most discrepancies are |
EDIT: I eliminated more code Ok, sorry I got a bit side tracked on this. Here is minimum code to cause the crash (files):
Hopefully this makes it easy to diagnose the issue |
First off, no matter how long I run it, if that numpy.random line isn't in there, the script doesn't crash. What could that possibly mean? Also it looks like there are two separate crashes. One comes on malloc: Success Crash The other one seems to come on some sort of synchronization/lock/barrier: Success
Crash:
|
Just worth mentioning. It seems there are major AMDGPU changes happening in linux kernel updates recently. So probably best to wait before trying any more diagnosing of such issues: |
Thanks, I agree. I have not really had much time to test this directly except just by building 6.11-rc4-rc6 and final kernel. In-directly I did some work on this by adding omnitrace to builds in hope it could be useful. At the moment I have done some basic tracing test with it on some test apps and being able to generate trace files that works on perfetto ui. (Our omnitrace uses the latest version of perfetto and that resolved the trace viewing problems that the upstream rocm sdk release has with the perfetto ui) But it could take some time to figure out how to use omnitrace in a way that it can catch this bug. That tool really takes some time to learn to configure and use properly. |
So here's another clue. When I run pytorch with Errors:
|
@jrl290 Thanks for the great test cases and traces, I think I have now a fix for this, your test case has now been running on loop multiple hundred rounds without crashing while earlier I got it stuck usually withing first 30-40 rounds. Unfortunately my fix requires patching a kernel and I still need to investigate little bit more that it does not have side effects or if I could do it in some other way. It's been some years when I have before this weekend looked for the amdkfd code, so I need to study this little bit more for testing and before pushing the fix out. In received also an older gfx1010 card which is suffering from a little similar type of problem, so hopefully I can get also that one fixed. (Have not had yet tested the fix on that gpu) |
Wow very cool! I actually ended up offloading the major AI processing to one of the new M4 Mac Minis. It is a good 2-3 times faster. The other machine is still a part of the process; just doing more CPU stuff while the M4 is dedicated to the AI stuff I am very curious to know what you found the problem to be. And I'll be happy to test when it's ready |
@jrl290 Attached is the new version of your test case, it's basically same just small helper changes without modifying your original logic.
|
@jrl290 Here is the link to kernel fix. It took a while as I tried couple of different way to fix it but this was basically the only one I figured out to work. https://github.com/lamikr/linux/tree/release/rocm_612_gfx1102_fix I use this script&kernel config on my own testing I submitted the patch also to kernel mailing list and put your id there for credits for good test case. https://lists.freedesktop.org/archives/amd-gfx/2024-November/117242.html |
That is very cool! I've never had any part in contributing to such a project before My linux kung-fu is not that strong, so it'll take me a while to figure out building and patching the kernel (v6.12 doesn't have an amd64 build available for some reason). I will report back when I have figured it out |
These should be easy steps:
That should handle everything from building to installing. The script will create the ../b_6_12_0 directory for storing build files. If the build is succesfull, it will ask the sudo password before installing the kernel modules under /lib/modules directory and the kernel itself to /boot directory. Then just reboot and select the 6.12+ kernel from the list of kernels to boot. |
Ran all of my use cases a few times and it is looking good! Way to go! I'll let you know if anything weird pops up Cheers! |
Thank for confirming that things works. |
I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030):
HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang
I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.
Maybe there's something much more obvious to you or an easier way to track down the issue
Some commands it has crashed on:
torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
torch.zeros([*batch_dims, c, n - f, t]).to(device)
torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
torch.cuda.synchronize()
Here's the kernel log with a few of these crashes
The text was updated successfully, but these errors were encountered: