-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to run recovery tasks #6540
Comments
PlanWhat we need to implement
Wish list, i.e. would be nice to have
as always, comments are welcome! |
Start from the basicsScope: implement PrerequisitesDesign choicesI think we can implement the new command in many ways, I still have to decide which one is the best.
|
Option a)I am curently exploring option a). It felt like the natural solution, and I wanted to have a strong reason before deciding to move the logic from crab report to TW. How to replicate: after submitting a failing task (with the config described above), source crab client such that it runs code from [1], then run
At the moment,
Next steps are:
|
Promising result: starting from a failing task [1], I managed to submit a proper recovery task [2] that processed all the lumis that have been left behind. I still have many details to sort out, many rough edges to smooth, but the approach of keeping all the logic in crabclient looks promising. Replicate the result with:
[1] https://cmsweb-test11.cern.ch/crabserver/ui/task/230724_110848%3Admapelli_crab_recover_20230724_130846 (type Analysis, input from DBS, no publication.) Lumis, from crab report
[2] https://cmsweb-test11.cern.ch/crabserver/ui/task/230728_175235%3Admapelli_crab_test_20230728_195233 Lumis, from crab report
|
option a)There are still a couple of things that I do not like:
[1] https://github.com/dmwm/CRABClient/blob/00d5e3ced6389fe328fe429eb41941a2a6a7f1d6/src/python/CRABClient/Commands/SubCommand.py#L312 |
@mapellidario it will take me a bit to go through all of this, thanks for careful description. So I'd focus on "we only recover a task for which the proejct(working) directory is still around and which was submitted less than 30 days ago". And leave as a future project, to recover "any task" from info obtainable via |
Using a modified version of Wa's "copytask" [1], I managed to submit a recovery task that uses the lumimask with unprocessed lumis from "crab report". Replicate the result
My code is still at [2], but it is not ready for review yet. I will need to add some major improvements:
and some small improvements
[1] dmwm/CRABClient@master...novicecpp:new_jobtype_copyoftask |
It is now possible to submit the recover task to a different crabserver instance. ACHTUNG! This usecase will likely only be necessary for crab developers, not for regular users. We will not support linking crab tasks across different crabserver instances, so some features will not be available, such as automatically aggregating Example: recover the task crab recover \
--task=230925_142750:dmapelli_crab_recover_20230925_162750 \
--instance test11\
--destinstance preprod |
I amazed that this is necessary, anyhow.. congratulations ! |
not for However, Wa already implemented this functionality in his version of the "copycat" jobtype, so i needed to support it with my version of the "copycat" jobtype if we want to merge the two versions :) |
problemI noticed that solutionIn order to recover a task with filebased splitting, we can use info from [1] https://cmsweb.cern.ch:8443/scheddmon/0107/vsedighz/231013_143854:vsedighz_crab_FBCMDIGIPU200/
|
you should find in the SPOOL_DIR/debug folder the tar with the list of files for each job, and from |
aaaah i see, thanks Stefano! I can use the My previous comment was misled by a task that I found that we can define as a bit patological [2] :) All good now, I know what to do :) [1] https://cmsweb.cern.ch:8443/scheddmon/059/dmapelli/231013_153540:dmapelli_crab_recoverfile_20231013_173538/SPOOL_DIR/input_files.tar.gz 231013_143854:vsedighz_crab_FBCMDIGIPU200SPOOL_DIR/input_files.tar.gz/job_input_file_list_1.txt
config: # [...]
config.JobType.psetName = 'new_cfg.py'
config.JobType.pluginName = 'Analysis'
config.JobType.pyCfgParams = ['pu=200']
config.Data.inputDataset = ''
config.Data.unitsPerJob = 1
config.Data.userInputFiles = ['file:/dev/null']
config.Data.splitting = 'FileBased'
# [...] pset # [...]
process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(100) )
# [...] |
yeah... that must have been some test or experiment or whatever... !
still, I think we should put all those tarfiles in S3 |
I added a few commits to my crab recover branch [1] to make it compatible with recent crab client changes, in particular
Keep in mind that you can run |
I added a new feature: A recovery task publishes in the same dataset of the original failing task, even if the failing task did not specify an output dataset tag: example 0: output dataset specified (this was already done, since the output dataset tag is inherited from the parent task config)
example 1: no output dataset specified
this required an ugly "hack" to remove the zeros as described in #4947 failingTaskPublishName = failingTaskPublishName.replace("-00000000000000000000000000000000", "") maybe there are better ways of doing this, maybe we save the same without the zeros somewhere, I am open to suggestions! |
Do we need this? Because in DBS and Rucio ASO level, the files are belong to the same dataset if both PSets are the same. |
@novicecpp I am afraid we need something like this in order to pass to ASO the information it needs. Then I do not see why replacing the zeroes with I confess that this part of CRAB always confuses me. |
Because otherwise in the taskdb we set
If the workflow name and pset hash are the same [2], but making the recovery task have the same workflow name as the original failing task can create collisions on the user local filesystem, where the directory of the cache of the original failing task would be the same as the one of new recovery task. Or maybe I am missing something, it is true that both of you have definitely more experience than me with publication :) [1] https://cmsweb-test11.cern.ch/crabserver/ui/task/231027_094754%3Admapelli_crab_20231027_114753 [2] compare
|
OK, I think I see the problem with 00000000000000000000000000000000. Ugly but .. let's leave with it now, I may look for a better solution at some time, possibly need to revisit our handling of publish dataset name. But I am confused about the problem with duplicates file names. |
what is set by if I understand correctly:
[1] https://cmsweb-test11.cern.ch/crabserver/ui/task/231026_110823%3Admapelli_crab_recovernotag_20231026_130815 published to /GenericTTbar/dmapelli-crab_recovernotag_20231026_130815-94ba0e06145abd65ccb1d21786dc7e1d/USER |
checkpointI am ready to submit the first pull request: dmwm/CRABClient#5250 Let me write down where do we stand at the moment and what are my plans for the future. nomenclature:
example crab recover \
--task=231104_120158:kivanov_crab_Bfinder_2018_RNF_signal_test_UL_2018D-UL2018_-BPH5 \
--strategy=notFinished \
--instance=prod \
--destinstance=test11 \
--proxy=$X509_USER_PROXY which tasks can we recover with current implementation
open questions:
plans for the future(as I add them here, I will cross them away in previous comments to keep this one as single source of truth). functional
technical
|
@mapellidario I believe that with the last work bit now captured in dmwm/CRABClient#5330 this can be closed. Do you agree ? Am I forgetting anything ? |
yessss we can close! (after quite a while :) ) |
user complain that creating and running recovery tasks is annoying
while improvements like indicated in #6539 (if ever done) may lessen the need for this, there is no way around the basic design of CRAB that workflows come in well defined tasks, i.e. unit of work which are not extended/extendable by adding data and can not be kept around for very long time. We currently allow one month to complete a task, then it is scratched. Recovery tasks will always be needed, just like P&R relies on ACDC.
Recovery task procedure is well documented
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Troubleshoot
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ#Recovery_task
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3AdvancedTutorial#Exercise_3_recovery_task
and reports of problem with it are really exceptional, but requires a bit of manual work which could probably be automated more.
Ideas:
crab recovery --task <taskname>
command which does all the legowork automaticallycrab report
make sure that recovery information is saved somewhere before task expires and is purged so that recovery can also be started for tasks close to deadline or "freshly expired". -> we decided not to do this :)The text was updated successfully, but these errors were encountered: