-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move pat::XGBooster
out of PhysicsTools/PatAlgos
into its own package and use it for PhotonXGBoostEstimator
#45085
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45085/40404
|
A new Pull Request was created by @mmusich for master. It involves the following packages:
The following packages do not have a category, yet: PhysicsTools/XGBoost @mandrenguyen, @jfernan2, @cmsbuild, @vlimant, @ftorrresd, @hqucms can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
Pull request #45085 was updated. @jfernan2, @wpmccormack, @ftorrresd, @valsdav, @vlimant, @hqucms, @mandrenguyen can you please check and sign again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto ret = XGBoosterPredictFromDMatrix(booster_, dvalues, json, &out_shape, &out_len, &score);
XGDMatrixFree(dvalues);
if (ret == 0) {
assert(out_len == 1 && "Unexpected prediction format");
result = score[0];
}
I can't find particular details about the API, but I suppose that score
, a pointer, points to something in dvalues
handle (and we might see problems only with multi-threaded running), so it is safer to move XGDMatrixFree(dvalues);
to the end of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @smorovic OK, I can implement that.
For the record, I am not the original author of this code (just moved it in order to get it accessible to other consumers), let me tag @drkovalskyi that introduced it in #43622 in case he has comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Also, for the record, XGB photon producer was freeing it only after accessing output score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ interface to XGBoost is voodoo, so don't ask me how I made it work. All this was developed with trials and errors over years following changes in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed at e6df909
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-80df63/39594/summary.html Comparison SummarySummary:
|
@cms-sw/ml-l2 any objections to move forward? |
ping @cms-sw/ml-l2 |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
@cms-sw/orp-l2 kind ping on this PR |
+1 |
I suspect this PR caused instabilities in the TSG IB integration tests.
the pattern seems to be erratic though because this PR was merged in |
Someone needs to look at this at high level. I can help if the issue is with XGBoost and it's C API itself. It's not fun, but I had to debug it in the past. |
is that reproducible in a given IB and configuration ; and what is that so one can reproduce ?
|
I am afraid it is not reproducible given the failure pattern, but I haven't checked yet. that requires a rather long execution time (few hours on a free machine). I'll try to craft a faster reproducer slimmed down of the unnecessary bits. |
here's a minimal attempt to reproduce (or at least to test reproducibility) - unfortunately as I was expecting this doesn't reproduce the failures observed in IBs. #!/bin/bash
#
# cmsrel CMSSW_14_1_X_2024-06-13-1100
# cd CMSSW_14_1_X_2024-06-13-1100/src
# cmsenv
# git cms-addpkg HLTrigger/Configuration
# scram b -j 20
#
addOnTests.py -t hlt_mc_GRun
jobTag=threads4
hltMenu=/dev/CMSSW_14_0_0/GRun/V141
check_log () {
grep '0 HLT_DiphotonMVA14p25_Tight_Mass90_v' $1 | grep TrigReport
}
run(){
echo $2
cp $1 $2.py
cat <<EOF >> $2.py
process.options.numberOfThreads = 4
process.options.numberOfStreams = 4
process.hltOutputMinimal.fileName = '${2}.root'
EOF
cmsRun "${2}".py &> "${2}".log
check_log "${2}".log
}
hltGetCmd="hltGetConfiguration ${hltMenu}"
hltGetCmd+=" --globaltag auto:run3_mc_GRun --mc --unprescale --output minimal --max-events -1"
hltGetCmd+=" --input file:addOnTests/hlt_mc_GRun/RelVal_Raw_GRun_MC.root"
#echo $hltGetCmd
configLabel=hlt_"${jobTag}"_onlyDiphotonMVA14p25_Tight_Mass90
#echo "${configLabel}".py
${hltGetCmd} --paths HLT_DiphotonMVA14p25_Tight_Mass90_v1 > "${configLabel}".py
for job_i in {0..9}; do run "${configLabel}".py "${configLabel}"_"${job_i}"; done; unset job_i;
configLabel=hlt_"${jobTag}"_full
${hltGetCmd} > "${configLabel}".py
for job_i in {0..9}; do run "${configLabel}".py "${configLabel}"_"${job_i}"; done; unset job_i; |
I suspect there is a problem exposed by multithreading. With 1 thread and 1 cmssw stream, i can't get crashes. When using 4 threads and 4 cmssw streams, a reproducer similar to #45085 (comment) fails at least 10% of the time for me. I didn't investigate much. I can only guess that maybe
inside predict there is a reset
and then right after that another event tries to call predict again and runs into nan s.
After converting diff --git a/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc b/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
index 3a22274edb4..15875c31581 100644
--- a/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
+++ b/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
@@ -3,8 +3,7 @@
#include "DataFormats/EgammaReco/interface/SuperClusterFwd.h"
#include "DataFormats/RecoCandidate/interface/RecoEcalCandidate.h"
#include "DataFormats/RecoCandidate/interface/RecoEcalCandidateIsolation.h"
-#include "FWCore/Framework/interface/global/EDProducer.h"
-#include "FWCore/Framework/interface/one/EDProducer.h"
+#include "FWCore/Framework/interface/stream/EDProducer.h"
#include "FWCore/Framework/interface/Event.h"
#include "FWCore/MessageLogger/interface/MessageLogger.h"
#include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
@@ -17,7 +16,7 @@
#include <memory>
#include <vector>
-class PhotonXGBoostProducer : public edm::global::EDProducer<> {
+class PhotonXGBoostProducer : public edm::stream::EDProducer<> {
public:
explicit PhotonXGBoostProducer(edm::ParameterSet const&);
~PhotonXGBoostProducer() = default;
@@ -25,7 +24,7 @@ public:
static void fillDescriptions(edm::ConfigurationDescriptions& descriptions);
private:
- void produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const override;
+ void produce(edm::Event&, edm::EventSetup const&) override;
const edm::EDGetTokenT<reco::RecoEcalCandidateCollection> candToken_;
const edm::EDGetTokenT<reco::RecoEcalCandidateIsolationMap> tokenR9_;
@@ -79,7 +78,7 @@ void PhotonXGBoostProducer::fillDescriptions(edm::ConfigurationDescriptions& des
descriptions.addWithDefaultLabel(desc);
}
-void PhotonXGBoostProducer::produce(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const {
+void PhotonXGBoostProducer::produce(edm::Event& event, edm::EventSetup const& setup) {
const auto& recCollection = event.getHandle(candToken_);
//get hold of r9 association map |
thanks. In my initial attempt I only tested 10 times in the loop (and didn't get failures). Running 30 times I indeed get 3 failures out of 30 attempts at index 17,19 and 29. Based on the failure rate in IBs I was assuming 10 trials would be enough. |
@mmusich XGBooosterPRedictFromDMatrix is thread safe (as claimed by XGBoost documentation), so we could also provide |
thanks @smorovic . FWIW I tested that also this variant solves the crashes with the test setup at #45232 (comment). I am wondering if it possible to reduce code duplication (e.g. by changing the other clients to use the new version of |
to track better the discussion I opened #45235. Let's follow-up there. |
PR description:
Title says it all, implements #45040 (comment):
PR validation:
scram b runtests_RecoEgammaPhotonIdentificationTest
runs.If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:
N/A
@smorovic please take a look in case I overlooked something.