Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move pat::XGBooster out of PhysicsTools/PatAlgos into its own package and use it for PhotonXGBoostEstimator #45085

Merged
merged 3 commits into from
Jun 5, 2024

Conversation

mmusich
Copy link
Contributor

@mmusich mmusich commented May 28, 2024

PR description:

Title says it all, implements #45040 (comment):

There seems to exist a pat::XGBooster abstraction already in PhysicsTools/PatAlgos, that does the proper XGBoosterSetParam(..., "nthread", "1") call

  • The abstraction should really be moved to its own package, do decouple it and PhysicsTools/PatAlgos dependencies
  • One option for PhotonXGBoostEstimator would be to use this XGBooster abstraction

PR validation:

scram b runtests_RecoEgammaPhotonIdentificationTest runs.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

N/A

@smorovic please take a look in case I overlooked something.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 28, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45085/40404

  • This PR adds an extra 36KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @mmusich for master.

It involves the following packages:

  • PhysicsTools/PatAlgos (xpog, reconstruction)
  • PhysicsTools/XGBoost (****)
  • RecoEgamma/PhotonIdentification (reconstruction)

The following packages do not have a category, yet:

PhysicsTools/XGBoost
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@mandrenguyen, @jfernan2, @cmsbuild, @vlimant, @ftorrresd, @hqucms can you please review it and eventually sign? Thanks.
@jdamgov, @seemasharmafnal, @sameasy, @Senphy, @mariadalfonso, @gkasieczka, @Ming-Yan, @nhanvtran, @varuns23, @mmarionncern, @jainshilpi, @a-kapoor, @lgray, @castaned, @hatakeyamak, @valsdav, @jdolen, @Sam-Harper, @Prasant1993, @gpetruc, @gouskos, @rappoccio, @ahinzmann, @andrzejnovak, @azotz, @afiqaize, @AlexDeMoor, @schoef, @demuller, @sobhatta, @missirol, @mbluj, @ram1123 this is something you requested to watch as well.
@rappoccio, @sextonkennedy, @antoniovilela you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented May 29, 2024

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

Pull request #45085 was updated. @jfernan2, @wpmccormack, @ftorrresd, @valsdav, @vlimant, @hqucms, @mandrenguyen can you please check and sign again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmusich,

  auto ret = XGBoosterPredictFromDMatrix(booster_, dvalues, json, &out_shape, &out_len, &score);
  XGDMatrixFree(dvalues);
  if (ret == 0) {
    assert(out_len == 1 && "Unexpected prediction format");
    result = score[0];
  }

I can't find particular details about the API, but I suppose that score, a pointer, points to something in dvalues handle (and we might see problems only with multi-threaded running), so it is safer to move XGDMatrixFree(dvalues); to the end of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smorovic OK, I can implement that.
For the record, I am not the original author of this code (just moved it in order to get it accessible to other consumers), let me tag @drkovalskyi that introduced it in #43622 in case he has comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Also, for the record, XGB photon producer was freeing it only after accessing output score.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ interface to XGBoost is voodoo, so don't ask me how I made it work. All this was developed with trials and errors over years following changes in the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed at e6df909

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-80df63/39594/summary.html
COMMIT: a2294b4
CMSSW: CMSSW_14_1_X_2024-05-28-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45085/39594/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3338862
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3338839
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor Author

mmusich commented May 31, 2024

@cms-sw/ml-l2 any objections to move forward?

@mmusich
Copy link
Contributor Author

mmusich commented Jun 4, 2024

ping @cms-sw/ml-l2

@valsdav
Copy link
Contributor

valsdav commented Jun 4, 2024

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 4, 2024

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@mmusich
Copy link
Contributor Author

mmusich commented Jun 5, 2024

@cms-sw/orp-l2 kind ping on this PR

@antoniovilela
Copy link
Contributor

+1

@mmusich
Copy link
Contributor Author

mmusich commented Jun 14, 2024

I suspect this PR caused instabilities in the TSG IB integration tests.
We had already three failures concerning HLT_DiphotonMVA14p25_Tight_Mass90_v1:

  • in CMSSW_14_1_X_2024-06-08-1100: log, with a segmentation fault:
Thread 1 (Thread 0x151142b06680 (LWP 3442432) "cmsRun"):
#0  0x0000151140bbaac1 in poll () from /lib64/libc.so.6
#1  0x000015113cbc0657 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x000015113cbc0854 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00001510e252886b in PhotonXGBoostProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-06-09-0000/lib/el8_amd64_gcc12/pluginRecoEgammaPhotonIdentificationPlugins.so
  • in CMSSW_14_1_X_2024-06-12-1900: log, the trigger path when run standalone gave different amount of fired events than when run in the whole menu;
  • in CMSSW_14_1_X_2024-06-13-1100: log, there is a crash with:
----- Begin Fatal Exception 14-Jun-2024 03:24:48 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing  Event run: 1 lumi: 49 event: 4820 stream: 1
[1] Running path 'HLT_DiphotonMVA14p25_Tight_Mass90_v1'
[2] Calling method for module PhotonXGBoostProducer/'hltPhotonXGBoostProducer'
Exception Message:
A std::exception was thrown.
Feature is not set: rawEnergy
----- End Fatal Exception -------------------------------------------------   

the pattern seems to be erratic though because this PR was merged in CMSSW_14_1_X_2024-06-05-2300 and after that the HLT integration tests succeeded several times before starting to fail (CMSSW_14_1_X_2024-06-05-2300, CMSSW_14_1_X_2024-06-06-2300, CMSSW_14_1_X_2024-06-07-1100, CMSSW_14_1_X_2024-06-07-2300, CMSSW_14_1_X_2024-06-08-0600 and CMSSW_14_1_X_2024-06-08-1100). After that the tests succeeded on a seemingly random basis. I'll keep monitoring the situation in CMSSW_14_0_X to evaluate if a roll-back is needed.

@drkovalskyi
Copy link
Contributor

Someone needs to look at this at high level. I can help if the issue is with XGBoost and it's C API itself. It's not fun, but I had to debug it in the past.

@vlimant
Copy link
Contributor

vlimant commented Jun 14, 2024

is that reproducible in a given IB and configuration ; and what is that so one can reproduce ?

Feature is not set: rawEnergy is also triggered by NaN BTW.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 14, 2024

is that reproducible in a given IB and configuration ; and what is that so one can reproduce ?

I am afraid it is not reproducible given the failure pattern, but I haven't checked yet.
For reproducing, the command that the bot runs in IB is this one:

https://github.com/cms-sw/cms-bot/blob/2d05c7e2b0dcb33d8aa9a512bc9e61623d3ba482/run-hlt-validation#L37

that requires a rather long execution time (few hours on a free machine). I'll try to craft a faster reproducer slimmed down of the unnecessary bits.

@mmusich
Copy link
Contributor Author

mmusich commented Jun 15, 2024

I'll try to craft a faster reproducer slimmed down of the unnecessary bits.

here's a minimal attempt to reproduce (or at least to test reproducibility) - unfortunately as I was expecting this doesn't reproduce the failures observed in IBs.

#!/bin/bash

#
# cmsrel CMSSW_14_1_X_2024-06-13-1100
# cd CMSSW_14_1_X_2024-06-13-1100/src
# cmsenv
# git cms-addpkg HLTrigger/Configuration
# scram b -j 20
# 

addOnTests.py -t hlt_mc_GRun

jobTag=threads4
hltMenu=/dev/CMSSW_14_0_0/GRun/V141

check_log () {
  grep '0 HLT_DiphotonMVA14p25_Tight_Mass90_v' $1 | grep TrigReport
}

run(){
  echo $2
  cp $1 $2.py
  cat <<EOF >> $2.py

process.options.numberOfThreads = 4
process.options.numberOfStreams = 4

process.hltOutputMinimal.fileName = '${2}.root'
EOF
  cmsRun "${2}".py &> "${2}".log
  check_log "${2}".log
}

hltGetCmd="hltGetConfiguration ${hltMenu}"
hltGetCmd+=" --globaltag auto:run3_mc_GRun --mc --unprescale --output minimal --max-events -1"
hltGetCmd+=" --input file:addOnTests/hlt_mc_GRun/RelVal_Raw_GRun_MC.root"

#echo $hltGetCmd

configLabel=hlt_"${jobTag}"_onlyDiphotonMVA14p25_Tight_Mass90
#echo "${configLabel}".py
${hltGetCmd} --paths HLT_DiphotonMVA14p25_Tight_Mass90_v1 > "${configLabel}".py
for job_i in {0..9}; do run "${configLabel}".py "${configLabel}"_"${job_i}"; done; unset job_i;

configLabel=hlt_"${jobTag}"_full
${hltGetCmd} > "${configLabel}".py
for job_i in {0..9}; do run "${configLabel}".py "${configLabel}"_"${job_i}"; done; unset job_i;

@missirol
Copy link
Contributor

@smorovic @mmusich

I suspect there is a problem exposed by multithreading. With 1 thread and 1 cmssw stream, i can't get crashes. When using 4 threads and 4 cmssw streams, a reproducer similar to #45085 (comment) fails at least 10% of the time for me.

I didn't investigate much. I can only guess that maybe predict gets called

float XGBooster::predict(const int iterationEnd) {

inside predict there is a reset
void XGBooster::reset() { std::fill(features_.begin(), features_.end(), std::nan("")); }

and then right after that another event tries to call predict again and runs into nans.

After converting PhotonXGBoostProducer from a global module to a stream one, I don't get crashes. Maybe that's a minimal fix.

diff --git a/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc b/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
index 3a22274edb4..15875c31581 100644
--- a/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
+++ b/RecoEgamma/PhotonIdentification/plugins/PhotonXGBoostProducer.cc
@@ -3,8 +3,7 @@
 #include "DataFormats/EgammaReco/interface/SuperClusterFwd.h"
 #include "DataFormats/RecoCandidate/interface/RecoEcalCandidate.h"
 #include "DataFormats/RecoCandidate/interface/RecoEcalCandidateIsolation.h"
-#include "FWCore/Framework/interface/global/EDProducer.h"
-#include "FWCore/Framework/interface/one/EDProducer.h"
+#include "FWCore/Framework/interface/stream/EDProducer.h"
 #include "FWCore/Framework/interface/Event.h"
 #include "FWCore/MessageLogger/interface/MessageLogger.h"
 #include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
@@ -17,7 +16,7 @@
 #include <memory>
 #include <vector>
 
-class PhotonXGBoostProducer : public edm::global::EDProducer<> {
+class PhotonXGBoostProducer : public edm::stream::EDProducer<> {
 public:
   explicit PhotonXGBoostProducer(edm::ParameterSet const&);
   ~PhotonXGBoostProducer() = default;
@@ -25,7 +24,7 @@ public:
   static void fillDescriptions(edm::ConfigurationDescriptions& descriptions);
 
 private:
-  void produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const override;
+  void produce(edm::Event&, edm::EventSetup const&) override;
 
   const edm::EDGetTokenT<reco::RecoEcalCandidateCollection> candToken_;
   const edm::EDGetTokenT<reco::RecoEcalCandidateIsolationMap> tokenR9_;
@@ -79,7 +78,7 @@ void PhotonXGBoostProducer::fillDescriptions(edm::ConfigurationDescriptions& des
   descriptions.addWithDefaultLabel(desc);
 }
 
-void PhotonXGBoostProducer::produce(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const {
+void PhotonXGBoostProducer::produce(edm::Event& event, edm::EventSetup const& setup) {
   const auto& recCollection = event.getHandle(candToken_);
 
   //get hold of r9 association map

@mmusich
Copy link
Contributor Author

mmusich commented Jun 15, 2024

When using 4 threads and 4 cmssw streams, a reproducer similar to #45085 (comment) fails at least 10% of the time for me.

thanks. In my initial attempt I only tested 10 times in the loop (and didn't get failures). Running 30 times I indeed get 3 failures out of 30 attempts at index 17,19 and 29. Based on the failure rate in IBs I was assuming 10 trials would be enough.
In any case I propose your minimal fix here: #45232. I guess it might have a cost in terms of memory consumption (hopefully negligible).

@smorovic
Copy link
Contributor

When using 4 threads and 4 cmssw streams, a reproducer similar to #45085 (comment) fails at least 10% of the time for me.

thanks. In my initial attempt I only tested 10 times in the loop (and didn't get failures). Running 30 times I indeed get 3 failures out of 30 attempts at index 17,19 and 29. Based on the failure rate in IBs I was assuming 10 trials would be enough. In any case I propose your minimal fix here: #45232. I guess it might have a cost in terms of memory consumption (hopefully negligible).

@mmusich
Currently we load two files 1.4 and 1.1 MB respectively. If the module is turned into "stream" type, memory usage from the files will increase 24 times in HLT (which could still be acceptable, but not negligible).

XGBooosterPRedictFromDMatrix is thread safe (as claimed by XGBoost documentation), so we could also provide features vector to predict from the photon estimator module:
master...smorovic:cmssw:xgboost-features

@mmusich
Copy link
Contributor Author

mmusich commented Jun 16, 2024

XGBooosterPRedictFromDMatrix is thread safe (as claimed by XGBoost documentation), so we could also provide features vector to predict from the photon estimator module:
master...smorovic:cmssw:xgboost-features

thanks @smorovic . FWIW I tested that also this variant solves the crashes with the test setup at #45232 (comment). I am wondering if it possible to reduce code duplication (e.g. by changing the other clients to use the new version of predict that passes the input features vector, but I let @cms-sw/ml-l2 comment on that).

@mmusich
Copy link
Contributor Author

mmusich commented Jun 16, 2024

to track better the discussion I opened #45235. Let's follow-up there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants