-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes in Run 380399 #44923
Comments
cms-bot internal usage |
A new Issue was created by @trtomei. @makortel, @smuzaffar, @sextonkennedy, @rappoccio, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different: The stack traces that I posted above are the ones I got from F3mon, though. |
For reference, here are the log files of the 2 jobs. As reported in the elog above, they both contain the following.
|
assign heterogeneous, reconstruction |
New categories assigned: heterogeneous,reconstruction @fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
So in both cases the assertion in cmssw/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h Lines 76 to 80 in 8391245
fails, after which the processes get stuck, until they gets killed 6-7 hours later. (and because of the external termination the stack traces of only "the crashed thread" are meaningless alone) |
The state of |
FYI @AdrianoDee |
@trtomei, all circumstantial evidence points to vertex reconstruction. Can you edit the title to remove the |
For emphasis: external termination request means it timed out, likely due to a deadlock. The timeout signal will be delivered to thread 1, but to diagnose the deadlock we need to see the stack traces of all the threads. |
Looks like this happened again. One LS (run 380624, LS 411) could not be "closed" because one process was stuck and had to be terminated manually hours later. The full log (see below) shows again
similarly to #44923 (comment). |
There should be no assert on GPU in first place. |
maybe I missing something
|
The assert in CUDA were compiled out because they are slow. Of course the assert is still there on CPU but it will not print |
Currently we do have asserts in Alpaka code, because the code is new and IMHO it's better to get meaningful errors with a well defined location than random failures. Within alpaka code:
The plan is to change all The fact that asserts may be causing deadlocks is a different matter: in principle, a failed Do we have a way to reproduce the problem offline ? |
here is the old issue on CPU |
Did anybody reproduce the assert offline? |
This crash seems to appear again in the collision run 381067. Attaching the error log to this comment We were unable to reproduce the error after running the HLT config again on the error stream files. |
Do not see any sign of assert in the log from run 381067: |
it's at line 36196 in the log attached:
|
@gparida if you tried to reproduce offline, can you post the script you used here? |
Note that (also) in this case, the job was stuck not processing data for a while, and was eventually killed. |
I tried reproducing it (
Also, tried to rerun the job with
|
as discussed at the last TSG meeting, I think that's a possible way forward. |
Thanks @mmusich ! |
instrument `fitVertices` to output more information when failing assert (issue #44923)
[14.0.X] instrument `fitVertices` to output more information when failing assert (issue #44923)
Looking at the two lines of Adding diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
index e2ba0b46b8be..be3b20563663 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
@@ -150,6 +150,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
iv[it[k]] = igv;
}
+ // synchronise the threads before starting the next iteration of the loop of the groups
+ alpaka::syncBlockThreads(acc);
} // loop on vertices
}
before the end of the loop seems to make |
assign hlt
|
New categories assigned: hlt @Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Proposed fixes:
#45655 was included in |
+hlt
|
+heterogeneous |
@cms-sw/reconstruction-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue. |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Crashes observed in collisions Run 380399. Stack traces:
We tried to reproduce it with the following recipe, but it didn't reproduce.
Message #8 in the first stack trace seems to point to
alpaka_cuda_async::EcalRawToDigiPortable::produce()
method.@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
Best regards,
Thiago (for FOG)
The text was updated successfully, but these errors were encountered: