-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes when export MALLOC_CONF=junk:true
is set.
#44956
Comments
cms-bot internal usage |
A new Issue was created by @VinInn. @makortel, @Dr15Jones, @smuzaffar, @antoniovilela, @rappoccio, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Does not seem to happen in MC Relvals |
running "single thread" this is the stack-trace
|
assign hlt |
New categories assigned: hlt @Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
reproduced with: #!/bin/bash -ex
scram p CMSSW CMSSW_14_0_5_patch1
cd CMSSW_14_0_5_patch1/src
eval `scramv1 runtime -sh`
export MALLOC_CONF=junk:true
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380115 > hlt_run380115.py
cat <<@EOF >> hlt_run380115.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
runNumber = 380115
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
fileListMode = True,
fileNames = (
'/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000079_fu-c2b03-28-01_pid1451372.raw',
'/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000104_fu-c2b03-28-01_pid1451372.raw'
)
)
process.options.wantSummary = True
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
mkdir run380115
cmsRun hlt_run380115.py &> crash_run380115.log @cms-sw/ecal-dpg-l2 please take a look. |
At least it does reproduce on HLT addOnTests on data, see log, preview of cms-sw/cms-bot#2228 output. |
What is the purpose of |
Setting |
https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug |
Another reproducer (with a more recent release): Click me#!/bin/bash -ex
cd CMSSW_14_0_6_MULTIARCHS/src
eval `scramv1 runtime -sh`
export MALLOC_CONF=junk:true
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380466 > hlt_run380466.py
cat <<@EOF >> hlt_run380466.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
runNumber = 380466
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
fileListMode = True,
fileNames = (
'/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000212_fu-c2b03-09-01_pid672001.raw',
'/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000232_fu-c2b03-09-01_pid672001.raw',
'/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000246_fu-c2b03-09-01_pid672001.raw'
)
)
process.options.accelerators = ['cpu']
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
directory="run380466"
# Check if the directory exists
if [ -d "$directory" ]; then
# If it exists, remove it
rm -rf "$directory"
fi
# Create the directory
mkdir "$directory"
cmsRun hlt_run380466.py &> crash_run380466.log The crash described in the issue happens here:
adding a check on the existence of the cell + if(this_cell==nullptr)
+ continue; we get past there, but then there is an exception:
which matches the exception seen in the relval tests of cms-sw/cms-bot#2228 at cms-sw/cms-bot#2228 (comment). The main question I have is if |
2779096485
is THE unk
printf("%x", 2779096485);
a5a5a5a5
|
ECAL detIDs all start with an 8. So junk it is indeed. |
adding: diff --git a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
index 56a9292da36..ff352a772f8 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
@@ -261,6 +261,18 @@ void EcalRecHitProducer::produce(edm::Event& evt, const edm::EventSetup& es) {
LogInfo("EcalRecHitInfo") << "total # EB calibrated rechits: " << ebRecHits->size();
LogInfo("EcalRecHitInfo") << "total # EE calibrated rechits: " << eeRecHits->size();
+ // Loop over EBRecHitCollection
+ for (const auto& ebRecHit : *ebRecHits) {
+ DetId detId = ebRecHit.detid(); // Get the DetId
+ std::cout << "EB DetId: " << detId.rawId() << std::endl; // Print the rawId of the DetId
+ }
+
+ // Loop over EERecHitCollection
+ for (const auto& eeRecHit : *eeRecHits) {
+ DetId detId = eeRecHit.detid(); // Get the DetId
+ std::cout << "EE DetId: " << detId.rawId() << std::endl; // Print the rawId of the DetId
+ }
+
evt.put(ebRecHitToken_, std::move(ebRecHits));
evt.put(eeRecHitToken_, std::move(eeRecHits));
} I get:
so it looks like |
is most probably leaving uninitialized or coping from uninitialized memory, as it is jmalloc that is filling it with "junk" |
I need to check but I suspect that these extra junk RecHits are already in the input collection of the |
They are: diff --git a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
index 56a9292da36..b7e5fd4ed66 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
@@ -146,11 +146,31 @@ void EcalRecHitProducer::produce(edm::Event& evt, const edm::EventSetup& es) {
const auto& eeUncalibRecHits = evt.get(eeUncalibRecHitToken_);
LogDebug("EcalRecHitDebug") << "total # EE uncalibrated rechits: " << eeUncalibRecHits.size();
+ // Loop over uncalib EERecHitCollection
+ for (const auto& eeRecHit : eeUncalibRecHits) {
+ DetId detId = eeRecHit.id(); // Get the DetId
+
+ // Check if the rawId corresponds to 2779096485
+ if (detId.rawId() == 2779096485) {
+ std::cout << "EE Uncalib -DetId: " << detId.rawId() << " - Line: " << __LINE__ << std::endl;
+ }
+ }
+
// loop over uncalibrated rechits to make calibrated ones
for (const auto& uncalibRecHit : eeUncalibRecHits) {
worker_->run(evt, uncalibRecHit, *eeRecHits);
} yields:
|
This is NOT enough to crash
|
Is |
If the copy uses a host queue the wait is not needed, because the host queues are blocking by default. But it also does not harm, as it shouldn't do anything. For a host-to-device copy using a device queue, it's needed before the data can be accessed on the device using a different stream. For a device-to-host copy using device queue, it's needed before the data can be accessed on the host. That said, we have seen that for small memory copies, the GPU runtime seems to work OK also without the wait... |
Just to clarify that if the "different queue" happens because the second devices-ide access is in a different EDModule, the framework adds necessary synchronization so that explicit |
@thomreis @cms-sw/ecal-dpg-l2 please clarify if there is progress on this issue and if there is someone actively working on solving it. |
I'm not sure what has been already verified: anyhow if I add
I get so it is junk already in the digits |
and the origin is here (please not the FIXME)
cmsRun: src/EventFilter/EcalRawToDigi/plugins/alpaka/UnpackPortable.dev.cc:205: void alpaka_serial_sync::ecal::raw::Kernel_unpack::operator()(const TAcc&, const unsigned char*, const uint32_t*, const int*, PortableHostCollection<EcalDigiSoALayout<> >::View, PortableHostCollection<EcalDigiSoALayout<> >::View, PortableHostCollection<EcalElectronicsMappingSoALayout<> >::ConstView, uint32_t) const [with TAcc = alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 1>, unsigned int>; = void; uint32_t = unsigned int; PortableHostCollection<EcalDigiSoALayout<> >::View = EcalDigiSoALayout<>::ViewTemplateFreeParams<128, false, true, false>; PortableHostCollection<EcalElectronicsMappingSoALayout<> >::ConstView = EcalElectronicsMappingSoALayout<>::ConstViewTemplateFreeParams<128, false, true, false>]: Assertion `didraw != 2779096485' failed. |
Needless to say that
"cure" the symptom (not the CAUSE!) It remains that |
@thomreis this is a list of invalid EE detid that I found running the HLT over 10k events:
On average there are more than 4 invalid detid per event, but it seems to be the same ones repeating. |
+hlt tested explicitly with the following recipe: cmsrel CMSSW_14_0_9_MULTIARCHS
cd CMSSW_14_0_9_MULTIARCHS/src
cmsenv
git cms-addpkg EventFilter/EcalRawToDigi
git cms-merge-topic 45210
scram b -j 20 and using the reproducer at #44956 (comment) [*] that the segmentation fault (and all the preceding error messages) are gone. [*]#!/bin/bash -ex
export MALLOC_CONF=junk:true
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380466 > hlt_run380466.py
cat <<@EOF >> hlt_run380466.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
runNumber = 380466
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
fileListMode = True,
fileNames = (
'/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000212_fu-c2b03-09-01_pid672001.raw',
)
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.wantSummary = True
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
@EOF
mkdir run380466
cmsRun hlt_run380466.py &> crash_run380466.log |
This issue is fully signed and ready to be closed. |
As reported in #44940 (comment)
reproduced on lxplus8-gpu.cern.ch.
use the script in #44940 and just add
export MALLOC_CONF=junk:true
one will get a long list of
and then a segfault
The text was updated successfully, but these errors were encountered: