-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in HGCalCassette::getShift() in workflow 25234.911 #41927
Comments
assign geometry |
New categories assigned: geometry @mdhildreth,@Dr15Jones,@makortel,@bsunanda,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Please do not link this issue with the PR #41943 |
There is a new occurrence of segmentation fault in |
Occurred again in CMSSW_13_3_X_2023-09-21-2300:
|
Occured again in CMSSW_14_0_ROOT628_X_2023-11-22-2300:
|
@iarspider Could you add the stack trace of the thread that crashed (i.e. with |
@makortel I have updated my comment. Sometimes copying from "fancy" relval log fails like that... |
Hi @iarspider |
cms-bot internal usage |
@srimanob last segfault in |
I found it is a bit wired on how to produce the issue: On lxplus906.cern.ch (lxplusgpu with Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz)
Fail with
However, sometimes, it works. I also try to checkout the package Geometry/HGCalCommonData to debug, but never reproduce the issue. Note that, I never face issue with Intel Xeon Processor (Icelake), Intel Xeon Processor (Cascadelake), but not confirm. I ran several times, with no crash at all. Any suggestion on how to go @makortel @Dr15Jones @civanch @bsunanda ? This is a showstopper of DD4hep migration in Phase-2. Thx very much. |
Since the sanitizers (ASAN, UBSAN) do not seem to report anything related, the next step would be to run the job with valgrind, e.g. valgrind --tool=memcheck $(cmsvgsupp) --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp --num-callers=20 cmsRunGlibC DigiTrigger_DD4hep_2026D110.py When run through valgrind the job will be slow. There will also be many printouts (as the suppressions do not remove all false positives e.g. from python). Alternatively, or in addition, one could try what happens with export MALLOC_CONF=junk:true
cmsRun DigiTrigger_DD4hep_2026D110.py (see #44962) |
Thanks for suggestions during the meeting, and the way to go next. I have run valgrind, and the log can be found at /eos/cms/store/group/offcomp_upgrade-sw/srimanob/Phase2DD4hep/Log/Log-Valgrind-1410pre5DD4hep.log Before the first event starts, I see
After the event starts, the issue of "Conditional jump or move depends on uninitialised value(s)" comes in mainly two places including
Note to @cms-sw/geometry-l2 @cms-sw/l1-l2 @ptcox It is unclear they related to the issue we found, but maybe it is good to fix (if possible) anyways. Thx. |
So the first report from valgrind appears to come from here: cmssw/Geometry/HGCalCommonData/src/HGCalTileIndex.cc Lines 27 to 37 in e232a9b
which is just a static function with all data coming in from the arguments. The call to that function appears to come from
where the values passed in are coming from the class which comes from the call
where the topology comes from the EventSetup
on a cursory look, the only place I could find that sets
which I THINK only gets called for the old DDD based geometry. So my suggestion is to change these lines cmssw/Geometry/HGCalCommonData/interface/HGCalParameters.h Lines 197 to 199 in e232a9b
to int cassettes_ = 0;
int nphiCassette_ = 0;
int phiOffset_ = 0; and then retry the valgrind job. |
So I ran the job in the debugger and had it stop new the HGCalParameters object was constructed. It happened here
which does use
|
So it looks like HGCalParametersESModule does initialize all the values in the call which valgrind complained about. That means the uninitialized value would have had to have come from the DD4Hep values that were then used to reset the values. Looking at the configuration, I find three versions of HGCalParametersESModule
and the all appear to get the DD4Hep info from the unnamed source
|
As a first guess, I'd say the DD4Hep information comes from the following three module types
looking at the configuration for each of them, they do not take any arguments. |
The DDCompatViewESProducer just reads the cms::DDDetector from the EventSetup and that appears to come from cms.ESSource("DDDetectorESProducer",
appendToDataLabel = cms.string(''),
confGeomXMLFiles = cms.FileInPath('Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2026D110.xml')
) so the guess is the generated DDDetector information is what is leading to the uninitialized data. |
I think the question is can statements like cmssw/Geometry/HGCalCommonData/src/HGCalParametersFromDD.cc Lines 452 to 453 in e232a9b
lead to the set variable getting an undefined value? where cmssw/Geometry/HGCalCommonData/src/HGCalParametersFromDD.cc Lines 252 to 253 in e232a9b
and the DDCompactView is just
and the get is specialized as cmssw/DetectorDescription/DDCMS/src/DDFilteredView.cc Lines 597 to 603 in e232a9b
Here's a weird question, why is the value retrieved as a |
Should this kind of code be changed to e.g. throw an exception if |
Exception in geometry construction step? It seems strange to me that we can't construct geometry properly. |
Yes.
If the At minimum the |
Not sure it is related to what we are investigating, so I just post here. We see an instability of DD4hep workflow in the trigger result when we only add missing collection to the output. See #45425 |
To complete the test, I tried to runValgrind with Phase2-DDD, and Run3-DD4hep. Both, I don't see issue of geometry initialization before the first event. The full log can be access at Phase-2:
Run-3: One thing still unclear to me. If this is a cause of the issue, why do we see it only in step-2? I mean we should see the issue in step-1 (GEN-SIM) as geometry is reconstructed in the same way, I think. Not confirm, just what I see when we have a report on failure. |
So I recompiled the code in question with debugging symbols and ran valgrind. The message is now
which corresponds to
|
The only place
and cmssw/Geometry/HGCalCommonData/src/HGCalGeomParameters.cc Lines 1668 to 1669 in e232a9b
while
|
So I turned on the
|
Testing of #45442 by running this PR on top of CMSSW_14_1_0_pre5, no issues of #41927 and #41451 have been observed. Thanks @Dr15Jones |
I guess with the merge of #45442 we could close the issue? |
assign upgrade |
New categories assigned: upgrade @srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/geometry-l2 @cms-sw/upgrade-l2 Could at least one of you sign the issue to denote formally it has been resolved? |
+Upgrade |
+1 |
This issue is fully signed and ready to be closed. |
The segfault in HGCalCassette::getShift() (reported previously in #41376) is back, now in CMSSW_13_2_X_2023-06-11-2300 on el9_amd64_gcc11 in workflow 25234.911 step 2
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc11/CMSSW_13_2_X_2023-06-11-2300/pyRelValMatrixLogs/run/25234.911_TTbar_14TeV+2026D99_DD4hep/step2_TTbar_14TeV+2026D99_DD4hep.log#/
The text was updated successfully, but these errors were encountered: