Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resubmission of successful jobs only works while Dagman is running #8089

Closed
belforte opened this issue Dec 6, 2023 · 3 comments
Closed

resubmission of successful jobs only works while Dagman is running #8089

belforte opened this issue Dec 6, 2023 · 3 comments

Comments

@belforte
Copy link
Member

belforte commented Dec 6, 2023

properly verify.
Why failed jobs can be resubmitted also later ?

If possible fix. Otherwise document.
Connects to dmwm/CRABClient#5273

@belforte belforte self-assigned this Dec 6, 2023
@belforte
Copy link
Member Author

belforte commented Dec 8, 2023

maybe "as simple" as: if some jobs fail there's a Rescue DAG which we can trigger into action. But if will only contain failed nodes. ANd possibly the DAG stays in HOLD.

Once the dagman_bootstrap job has exited the HTCondor queue, there is not way to restarting. A new condor_submit is needed.

It will be nice to complement this with pointers to the code.
But maybe for the time being is enough for closing on "document that successful jobs can only be resubmitted when task is running". After that... there is crab recovery and/or crab clone.

Anyhow I am not sure that there are really important use cases where successful jobs needs to be resubmitted !

User report in https://cms-talk.web.cern.ch/t/crab-resubmit-force-not-working/32482/6 was something like "I can't believe that those jobs were successful and I want to try again". Nope !

@belforte
Copy link
Member Author

in the meanwhile, here's a pointer to DAGMAN documentation
https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-resubmit-failed.html
and to DagmanResubmitter code

if ('resubmit_jobids' in task) and task['resubmit_jobids']:
with HTCondorUtils.AuthenticatedSubprocess(proxy, tokenDir,
logger=self.logger) as (parent, rpipe):
if not parent:
schedd.edit(rootConst, "HoldKillSig", 'SIGKILL')
## Overwrite parameters in the os.environ[_CONDOR_JOB_AD] file. This will affect
## all the jobs, not only the ones we want to resubmit. That's why the pre-job
## is saving the values of the parameters for each job retry in text files (the
## files are in the directory resubmit_info in the schedd).
for adparam, taskparam in params.items():
if taskparam in ad:
schedd.edit(rootConst, adparam, ad.lookup(taskparam))
elif task['resubmit_'+taskparam] != None:
schedd.edit(rootConst, adparam, str(task['resubmit_'+taskparam]))
schedd.act(htcondor.JobAction.Hold, rootConst)
schedd.edit(rootConst, "HoldKillSig", 'SIGUSR1')
schedd.act(htcondor.JobAction.Release, rootConst)

and to AdjustSites.py which edits dagman logs
# Hold and release processing and tail DAGs here so that modifications
# to the submission and log files will be picked up.
schedd = htcondor.Schedd()
tailconst = "TaskType =?= \"TAIL\" && CRAB_ReqName =?= %s" % classad.quote(ad.get("CRAB_ReqName"))
if resubmitJobIds and ad.get('CRAB_SplitAlgo') == 'Automatic':
printLog("Holding processing and tail DAGs")
schedd.edit(tailconst, "HoldKillSig", 'SIGKILL')
schedd.act(htcondor.JobAction.Hold, tailconst)
if resubmitJobIds:
adjustedJobIds = []
filenames = getGlob(ad, "RunJobs.dag.nodes.log", "RunJobs[1-9]*.subdag.nodes.log")
for fn in filenames:
if hasattr(htcondor, 'lock'):
# While dagman is not running at this point, the schedd may be writing events to this
# file; hence, we only edit the file while holding an appropriate lock.
# Note this lock method didn't exist until 8.1.6; prior to this, we simply
# run dangerously.
with htcondor.lock(open(fn, 'a'), htcondor.LockType.WriteLock):
adjustedJobIds.extend(adjustPostScriptExitStatus(resubmitJobIds, fn))
else:
adjustedJobIds.extend(adjustPostScriptExitStatus(resubmitJobIds, fn))
## Adjust the maximum allowed number of retries only for the job ids for which
## the POST script exit status was adjusted. Why only for these job ids and not
## for all job ids in resubmitJobIds? Because if resubmitJobIds = True, which as
## a general rule means "all failed job ids", we don't have a way to know if a
## job is in failed status or not just from the RunJobs.dag file, while job ids
## in adjustedJobIds correspond only to failed jobs.
adjustMaxRetries(adjustedJobIds, ad)

@belforte
Copy link
Member Author

there's already a follow up in dmwm/CRABClient#5273 where this was resolved by adding a warning.
As definitive solution, we will deprecate resubmission of successful jobs dmwm/CRABClient#5285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant