Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to run recovery tasks #6540

Closed
1 of 7 tasks
belforte opened this issue Apr 16, 2021 · 24 comments
Closed
1 of 7 tasks

Make it easier to run recovery tasks #6540

belforte opened this issue Apr 16, 2021 · 24 comments

Comments

@belforte
Copy link
Member

belforte commented Apr 16, 2021

user complain that creating and running recovery tasks is annoying

while improvements like indicated in #6539 (if ever done) may lessen the need for this, there is no way around the basic design of CRAB that workflows come in well defined tasks, i.e. unit of work which are not extended/extendable by adding data and can not be kept around for very long time. We currently allow one month to complete a task, then it is scratched. Recovery tasks will always be needed, just like P&R relies on ACDC.

Recovery task procedure is well documented
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3Troubleshoot
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ#Recovery_task
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3AdvancedTutorial#Exercise_3_recovery_task
and reports of problem with it are really exceptional, but requires a bit of manual work which could probably be automated more.

Ideas:

  • implement a crab recovery --task <taskname> command which does all the legowork automatically
  • make sure that we properly recover task w/o DBS publication -> covered in new issue, filebased splitting
  • add Rucio to "bookkeeping of outputs" strategy for recovery -> covered in new issue, filebased splitting
  • even if user does not do crab report make sure that recovery information is saved somewhere before task expires and is purged so that recovery can also be started for tasks close to deadline or "freshly expired". -> we decided not to do this :)
  • reuse automatic splitting technology to run tail subdags also for tasks which did not use automatic splitting to begin with (e.g. nanoAOD processing in file split more) -> nope, we do not do this
  • have a persistent wrapper around crab ('a la ganga of old times) which tracks locally on user disk the submitted tasks and allows to manipulate them, e.g. grouping "per campaign". Ideally this could also be offered as a remote service. -> nope, we use a crab client command instead
  • tell/teach users to use REANA or similar modern tools which do all of this for them -> again, we decided to have a crab client command to help :)
@belforte belforte added Area: zDesired Feature and removed Area: Needed Feature flagged as such by Physics Coordination or similar labels Apr 5, 2022
@mapellidario mapellidario self-assigned this Mar 10, 2023
@mapellidario
Copy link
Member

mapellidario commented Jul 25, 2023

Plan

What we need to implement

  • automatic recovery of an Analysis task, with input in DBS, without output in DBS
  • automatic recovery of an Analysis task, with input in DBS, with output in DBS
    • propagate option --recovery= from crab recover to crab report
  • check: automatic recovery of an Analysis task, with input in DBS and automatic splitting is handled correctly
  • [ ]check: automatic recovery an Analysis task, with input in DBS and config.Data.secondaryInputDataset is handled correctly
  • disable automatic recovery on PrivateMC tasks
  • filebased splitting (can not use crab report, will likely need to use status_cache) (moved to Make it easier to run recovery tasks #6540 (comment))
  • automatic recovery of an Analysis task, with input not in DBS (config.Data.userInputFiles ? rucio?) (moved to Make it easier to run recovery tasks #6540 (comment))

Wish list, i.e. would be nice to have


as always, comments are welcome!

@mapellidario
Copy link
Member

Start from the basics

Scope: implement crab recover --task $taskname for an "Analysis" task whose input is in DBS

Prerequisites

  • failing task, whose jobs randomly fail: recover
  • task to complete the failing task: recover2

Design choices

I think we can implement the new command in many ways, I still have to decide which one is the best.

  • a) keep the logic in the client.

    • since the the current instructions in the twiki from submitting for a recovery task work ok (they heavily rely on based on crab report), we can re-use the client code from report.py into a new crab recover subcommand
    • advantage: do not duplicate crab report code into TW
    • cons:
      • crab client has not been designed to reuse code from another command into a new command. It can be done, but feels hacky.
      • crab client requires same scram_arch and cmssw_version as the one used to submit the original task
  • b) move the logic to TW, in a clean way for the user

    • create a new row in the DB, without any detail, just a reference to an old task.
      • maybe set status "NEW" on command "RECOVER"
      • maybe create a new REST endpoint, that skips all the validation of the many fields required for crab submit
    • TW reads the new "empty" line,
    • advantage: no need for intrusive changes in the client, which can be executed from whichever cmssw version
    • con:
      • duplicate code for computing notFinisher lumis from crab report in TW
      • new value for "command" column in task DB
      • not sure where to put the reference to the old task. maybe in an already existing CLOB?
  • c) move the logic to the TW, but the user needs to have the original config

    • create new option for task config, for example config.Data.recoverFromTaskname
    • use the same config as the original task, just add an option in the crab task config mentioning a failing task. the user run crab submit as usual
    • TW process the task as usual, then after DBSDataDiscovery, uses the same logic as in crab report to compute the list of lumis that have not been processed.
    • task submission goes on as usual
    • pro: no need for intrusive changes in crabclient, nor in crab rest and DB
    • cons:
      • user experience is not the best
      • still, requires to copy some logic from crab report to TW
      • crab recover need to be executed with same scram_arch and cmssw version as the original failing task.

@mapellidario
Copy link
Member

mapellidario commented Jul 25, 2023

Option a)

I am curently exploring option a). It felt like the natural solution, and I wanted to have a strong reason before deciding to move the logic from crab report to TW.

How to replicate: after submitting a failing task (with the config described above), source crab client such that it runs code from [1], then run

rm -rf crab_recover_20230724_130846
crab recover --task=230724_110848:dmapelli_crab_recover_20230724_130846 --instance test11

At the moment, crab recover is such that

  • it calls crab remake to create a cache for the given failing task
  • it calls crab kill to avoid any change in the failing task
  • it calls crab report to get the list of lumis that have not been processed in the failing task

Next steps are:

  • check SCRAM_ARCH and CMSSW version, they'd better match the ones from the original failing task.
  • get the configuration of original failing task
    • option 1) just re-use the same config used for the original submission. This feels easier to implement (as it does not require to implement a new crab client command), but requires the use to have the at hand the original config. It is a tradeoff that I am willing to make at this early stage
    • option 2) clone the config of the failing task, with something similar to dmwm/CRABClient@master...novicecpp:new_jobtype_copyoftask (thanks @novicecpp for the idea! :) )
  • change the config of the failing task, setting as a lumi mask the content of the file notFinished.json from crab report
  • run crab submit, to submit a new task with the edited config (same as the original, just a new config.Data.lumiMask

[1] dmwm/CRABClient@master...mapellidario:20230724-recover

@mapellidario
Copy link
Member

mapellidario commented Jul 28, 2023

Promising result: starting from a failing task [1], I managed to submit a proper recovery task [2] that processed all the lumis that have been left behind.

I still have many details to sort out, many rough edges to smooth, but the approach of keeping all the logic in crabclient looks promising.

Replicate the result with:

  • submit an analysis task whose some jobs fail
  • clone the git client code from [3]
  • cmsenv in the same env that was used to submit the original failing task
  • make sure you have at hand the same crab config, pset, ... used to submit the original failing task.
  • crab recover --task=$original_taskname --instance $instance1 -s crabConfig.py
    • instead of -c crabConfig.py use -s crabConfig.py (I will try to fix this)

[1] https://cmsweb-test11.cern.ch/crabserver/ui/task/230724_110848%3Admapelli_crab_recover_20230724_130846 (type Analysis, input from DBS, no publication.)

Lumis, from crab report
> cat lumisToProcess.json
{"1": [[419, 419], [592, 592], [652, 652], [1261, 1261], [1849, 1849], [1858, 1858], [2083, 2083], [2465, 2465], [2702, 2702], [2748, 2748]]}
> cat processedLumis.json
{"1": [[419, 419], [592, 592], [652, 652], [1261, 1261], [1849, 1849], [1858, 1858], [2702, 2702], [2748, 2748]]}
> cat notFinishedLumis.json
{"1": [[2083, 2083], [2465, 2465]]}

[2] https://cmsweb-test11.cern.ch/crabserver/ui/task/230728_175235%3Admapelli_crab_test_20230728_195233

Lumis, from crab report
> cat processedLumis.json
{"1": [[2083, 2083], [2465, 2465]]}

[3] dmwm/CRABClient@master...mapellidario:20230724-recover

@mapellidario
Copy link
Member

mapellidario commented Jul 31, 2023

option a)

There are still a couple of things that I do not like:

  • remove the need of using the option -s / --recover-config. One should be able to pass -c / --config

    • using -c / --config at the moment is not supported. as soon as the recover object is created, SubCommand parses the -c option even if I did not specify anywhere that recover needed it, then is creates a local directory with a new taskname for the recovery task, and eventually submit fails because it really wants to create a new directory for the new task, but since the directory is already there it fails
    • This would require some refactoring to how the option -c / --config is treated in crab client
    • The incriminated code is here [1] , maybe this can be executed only when crab submit is called, for example in the submit constructor [2], after this function [2] run
  • it should be possible to recover a task without having the config of the original task at hand

  • if the notFinishedLumis.json does not exist, exit!


[1] https://github.com/dmwm/CRABClient/blob/00d5e3ced6389fe328fe429eb41941a2a6a7f1d6/src/python/CRABClient/Commands/SubCommand.py#L312
[2] https://github.com/dmwm/CRABClient/blob/00d5e3ced6389fe328fe429eb41941a2a6a7f1d6/src/python/CRABClient/Commands/submit.py#L35
[3] https://github.com/dmwm/CRABClient/blob/00d5e3ced6389fe328fe429eb41941a2a6a7f1d6/src/python/CRABClient/ClientUtilities.py#L594C11-L594C11

@belforte
Copy link
Member Author

@mapellidario it will take me a bit to go through all of this, thanks for careful description.
Maybe we have a chat ?
But as a start be aware that original config and pset are currently not saved in S3, but AFAIK copycat and copyoftask picks it from the WEB_DIR/debug directory in the schedd, i.e. are only available for 30 days from submission.
OTOH, if the user still has the work directory of the task to be recovered, those will be available in <projdir>/inputs/debugFiles.tgz

So I'd focus on "we only recover a task for which the proejct(working) directory is still around and which was submitted less than 30 days ago".

And leave as a future project, to recover "any task" from info obtainable via crab remake which I think will require to put debugFiles.tgz in S3. Even when doing that, there is a time limit (3 months currently). We can surely increase the 3 months, but not make it infinite.
We should also review crab report to check if it uses info from DB or from scheduler's WEB_DIR, both have a finite lifetime anyhow, respectively 3 and 1 months.

@mapellidario
Copy link
Member

mapellidario commented Oct 2, 2023

Using a modified version of Wa's "copytask" [1], I managed to submit a recovery task that uses the lumimask with unprocessed lumis from "crab report".

Replicate the result

  • I submitted an "analysis" task with 5 jobs, three of which failed.
  • then, from a clean environment, without even the crab task directory, I run crab recover --task=230925_142750:dmapelli_crab_recover_20230925_162750 --instance test11 --proxy=$X509_USER_PROXY, which submitted this recovery task, running automatically behind the curtains
    • crab remake
    • crab kill
    • crab report
    • crab getsandbox (new command)
    • crab submit (with new Report.py jobtype, modification of wa's copycat)

My code is still at [2], but it is not ready for review yet. I will need to add some major improvements:

  • improve add some checks to avoid submitting a recovery task in case the original task still has some jobs running
  • support failing tasks of type private-mc will not support this
  • link the recovery task to the original task, so that "crab report" on the recovery task can take into account the original task as well. (moved to Make it easier to run recovery tasks #6540 (comment))

and some small improvements


[1] dmwm/CRABClient@master...novicecpp:new_jobtype_copyoftask
[2] dmwm/CRABClient@master...mapellidario:20230724-recover

@mapellidario
Copy link
Member

It is now possible to submit the recover task to a different crabserver instance.

ACHTUNG! This usecase will likely only be necessary for crab developers, not for regular users. We will not support linking crab tasks across different crabserver instances, so some features will not be available, such as automatically aggregating crab report results across a failing task and its recovery tasks.

Example: recover the task 230925_142750:dmapelli_crab_recover_20230925_162750 on test11 with a recovery task on preprod

crab recover \
--task=230925_142750:dmapelli_crab_recover_20230925_162750  \
--instance test11\
 --destinstance preprod

@belforte
Copy link
Member Author

I amazed that this is necessary, anyhow.. congratulations !

@mapellidario
Copy link
Member

I amazed that this is necessary

not for crab recover per se, you are right.

However, Wa already implemented this functionality in his version of the "copycat" jobtype, so i needed to support it with my version of the "copycat" jobtype if we want to merge the two versions :)

@mapellidario
Copy link
Member

problem

I noticed that crab report does not support tasks that use FileBased splitting [1], since it requires for examples input_dataset_lumis.json that is not guaranteed to be present and to be not empty when we use filebased splitting.

solution

In order to recover a task with filebased splitting, we can use info from crab report, so i will need to use status_cache or some other file on the schedd


[1]

https://cmsweb.cern.ch:8443/scheddmon/0107/vsedighz/231013_143854:vsedighz_crab_FBCMDIGIPU200/

Singularity> crab report -d crab_FBCMDIGIPU200 --proxy=$X509_USER_PROXY
Running crab status first to fetch necessary information.
Error: Cannot get all the needed information for the report. Notice, if your task has been submitted more than 30 days ago, then everything has been cleaned.
Log file is /home/dario/crab/local/z-submitted/crab_FBCMDIGIPU200/crab.log

@belforte
Copy link
Member Author

you should find in the SPOOL_DIR/debug folder the tar with the list of files for each job, and from crab status the list of successful jobs. I never remember if the debug folder is also in S3 now. status_cache will not help, it is the base for crab status !
I am not convinced that the list of input lumis is not available, at worst it can be picked from DBS, and the list of processed ones should be in FILEMETADATA, but that table is so heavy to process.... if you need lumi, we can always get lumis per file info from DBS
But I will simple start with the list of files still to be processed !

@mapellidario
Copy link
Member

aaaah i see, thanks Stefano! I can use the SPOOL_DIR/input_files.tar.gz file.

My previous comment was misled by a task that I found that we can define as a bit patological [2] :)

All good now, I know what to do :)

[1] https://cmsweb.cern.ch:8443/scheddmon/059/dmapelli/231013_153540:dmapelli_crab_recoverfile_20231013_173538/SPOOL_DIR/input_files.tar.gz
[2] https://cmsweb.cern.ch/crabserver/ui/task/231013_143854%3Avsedighz_crab_FBCMDIGIPU200

231013_143854:vsedighz_crab_FBCMDIGIPU200

SPOOL_DIR/input_files.tar.gz/job_input_file_list_1.txt

["file:/dev/null"]

config:

# [...]
config.JobType.psetName = 'new_cfg.py'
config.JobType.pluginName = 'Analysis'
config.JobType.pyCfgParams = ['pu=200']
config.Data.inputDataset = ''
config.Data.unitsPerJob = 1
config.Data.userInputFiles = ['file:/dev/null']
config.Data.splitting = 'FileBased'
# [...]

pset

# [...]
process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(100) )
# [...]

@belforte
Copy link
Member Author

yeah... that must have been some test or experiment or whatever... !

== CMSSW: 13-Oct-2023 16:40:20 CEST  Initiating request to open file file:/dev/null
== CMSSW: terminate called after throwing an instance of 'edm::Exception'
== CMSSW:   what():  An exception of category 'FatalRootError' occurred.

still, I think we should put all those tarfiles in S3

@mapellidario
Copy link
Member

I added a few commits to my crab recover branch [1] to make it compatible with recent crab client changes, in particular

Keep in mind that you can run crab recover from whatever cmssw version you want, the scram arch and cmssw version of the recovery task will be taken form the original task config, not from the environment where crab recover has been run from. So despite it is not strictly necessary, I opted for making crab recover code py2 compatible, so that it can be run with cmssw <= 9 for consistency.

[1] dmwm/CRABClient@master...mapellidario:20230724-recover

@mapellidario
Copy link
Member

I added a new feature: A recovery task publishes in the same dataset of the original failing task, even if the failing task did not specify an output dataset tag:

example 0: output dataset specified (this was already done, since the output dataset tag is inherited from the parent task config)

example 1: no output dataset specified

this required an ugly "hack" to remove the zeros as described in #4947

failingTaskPublishName = failingTaskPublishName.replace("-00000000000000000000000000000000", "")

maybe there are better ways of doing this, maybe we save the same without the zeros somewhere, I am open to suggestions!

@novicecpp
Copy link
Contributor

Do we need this? Because in DBS and Rucio ASO level, the files are belong to the same dataset if both PSets are the same.

@belforte
Copy link
Member Author

@novicecpp I am afraid we need something like this in order to pass to ASO the information it needs.
The problem is that the PSet has needs to be computed on the output file, after cmsRun has run.
So it is known to PostJob and used in ASO, but not stored in TaskDB.
IIUC then Dario needs a way to set a PublishName for the new task such that when it runs ASO scripts will find the same output-dataset identifier as the task to be recovered.

Then I do not see why replacing the zeroes with "" is important, but I guess I need to look at the code at some point.

I confess that this part of CRAB always confuses me.

@mapellidario
Copy link
Member

Then I do not see why replacing the zeroes with "" is important

Because otherwise in the taskdb we set tm_publish_name = crab_recovernotag_20231026_130815-00000000000000000000000000000000-00000000000000000000000000000000 such as in [1] :)

the files are belong to the same dataset if both PSets are the same

If the workflow name and pset hash are the same [2], but making the recovery task have the same workflow name as the original failing task can create collisions on the user local filesystem, where the directory of the cache of the original failing task would be the same as the one of new recovery task. Or maybe I am missing something, it is true that both of you have definitely more experience than me with publication :)


[1] https://cmsweb-test11.cern.ch/crabserver/ui/task/231027_094754%3Admapelli_crab_20231027_114753

[2] compare

@belforte
Copy link
Member Author

OK, I think I see the problem with 00000000000000000000000000000000. Ugly but .. let's leave with it now, I may look for a better solution at some time, possibly need to revisit our handling of publish dataset name.

But I am confused about the problem with duplicates file names.
Even if you submit twice the same crabConfig on same data, you end up with files output_1.root etc. in two separate disk directories and separate disk blocks. Maybe I do not understand what you mean by workflow name. A time stamp is inserted in task name by the server at submit time.

@mapellidario
Copy link
Member

what you mean by workflow name

what is set by General.requestName.

if I understand correctly:

  • given General.requestName, crab client creates the directory {General.workArea}/{General.requestName}
    • If I want to submit a recovery task, since I use crab remake, either i use two different "workAreas", or if I do not change request name I get an error like:
      Working area '/afs/cern.ch/user/d/dmapelli/crab/submit/0-dev/../z-submitted/crab_test_20231006_114236' already exists
      Please change the requestName in the config file
      
    • so, I decided to change the requestName of the recovery task, not to deal with directories in users' local directories (wherever they submit crab tasks from).
  • If the user does not set Data.outputDatasetTag, we set it to {username}-crab_{General.requestName}-{pset hash}-0[...]0, see [1]
    • if the user does not set the outputDatasetTag, then the files of the recovery are not guaranteed to end up in the same dbs dataset. In order to do so, either
      • a. the recovery task has the same requestName as the original task (pset has is already the same), but i want to avoid it, as explained above
      • b. or, I need to set explicitely Data.outputDatasetTag in the recovery task to match the one of the original failing task (removing those zeros)

[1]

https://cmsweb-test11.cern.ch/crabserver/ui/task/231026_110823%3Admapelli_crab_recovernotag_20231026_130815 published to /GenericTTbar/dmapelli-crab_recovernotag_20231026_130815-94ba0e06145abd65ccb1d21786dc7e1d/USER

@mapellidario
Copy link
Member

mapellidario commented Nov 10, 2023

checkpoint

I am ready to submit the first pull request: dmwm/CRABClient#5250

Let me write down where do we stand at the moment and what are my plans for the future.

nomenclature:

  • "original failing task": a task with some failed jobs, where we think that all automatic and manual resubmission will not help anymore, that we wish to kill, and that should be recovered with instructions at https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ#Recovery_task
  • "recovery task": the new task that continues on the same endeavour of an "original failing task"

example

crab recover \
--task=231104_120158:kivanov_crab_Bfinder_2018_RNF_signal_test_UL_2018D-UL2018_-BPH5 \
--strategy=notFinished \
--instance=prod \
--destinstance=test11 \
--proxy=$X509_USER_PROXY

which tasks can we recover with current implementation

  • analysis
  • splittings aware of lumisections: "automatic", "lumibased", "eventawarelumibased"
    • motivation: crab report should produce a meaningful output
  • task need to be "static",
    • no jobs running, no transfers ongoing, no publications ongoing.
  • input: needs to be in DBS, due to crab report implementation
  • output:
    • DBS
      • --strategy (notFinished|notPublished)
      • consider notPublished
      • recovery task publishes in the same dbs dataset
    • not in DBS
      • --strategy notFinished
  • at least one job is successfull
    • otherwise the user should better submit the same original task from scratch
    • motivation: no info in filemetadataDB, crab report fails. if we want, this is an easy fix.
  • it is possible to recover the task of another user, but this is still experimental!
    • however, the original task may not be killed if who run crab recover is not a crab operator
    • outputLFNDirBase is reset
    • I did not disable publication in this case, yet
    • in general, i did not check that we can recover in all cases (the only conflict i noticed so far was outLFNDirBase, but others may be there) and that we avoid all negative side effects

open questions:

  • double check if we can properly identify a task with automatic splitting to be "static"
  • i do not know how to treat the jobs statuses "cooloff", "held", "killed". should i consider these "final" or "transient"?
  • can we recover tasks with "non trivial" sandbox? user compiled libraries, ... and such

plans for the future

(as I add them here, I will cross them away in previous comments to keep this one as single source of truth).

functional

  • crab report should aggregate results across recovery tasks
    • check what happens with crab report when two tasks publish in the same dbs dataset
    • in task db, recovery task should point to original failing task
    • decided not to do. it is tricky to get right, we do not want to make promises, users should do their bookkeeping,
  • recover tasks whose input does not use lumisection information
    • input: from DBS, but using filebased splitting
    • input: list of lfns (not using DBS) (Data.userInputFiles)
    • input: rucio container
    • input: list of local files
    • moved to another issue

technical

  • crab recover should be available in crabapi
  • download from s3 as much as possible, not from schedd
    • TW upload to s3 run_and_lumis.tar.gz, input_files, ....
  • reduce code duplication with Wa's "copycat" jobtype moved to a new issue

@belforte
Copy link
Member Author

@mapellidario I believe that with the last work bit now captured in dmwm/CRABClient#5330 this can be closed. Do you agree ? Am I forgetting anything ?

@mapellidario
Copy link
Member

yessss we can close! (after quite a while :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants