resubmissions when using Rucio for ASO #7404

belforte · 2022-09-27T09:30:00Z

Below is very technical issue which we skip today NEED TO MOVE THIS TO GH
We discussed the resubmission problem. May have found a way out which does not require changes in Rucio. Katy volunteered to summarize.

We currently have a problem with Rucio stageout in case of job resubmissions after a file was produced. When jobs are run again the transfer of the output will fail immediately in Rucio due to checksum mismatch between original file and new one (checksum changes everytime even if one writes same file, most likely because EDM/root files contain timestamps) [2]. DIego is following up with Rucio developers. The checksum is a property of the replica, and one would need to edit or remove the replicat, but current policy only allows user to add DIDs and replicas in their scope, not to remove them, only Rucio admins can. Worst scenario is that we need to go back to Crab2 model of unique file name per job (a pain for users) which would need a wider discussion.
[1] https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsData
[2] detailed problem description:

Let’s imagine job #10 in a CRAB task runs.
It leaves out_10.root on local /store/temp/user at site S1
The file “name” is inserted in Rucio as a DID
The file “physical attributes” like checksum and RSE (site) are inserted in Rucio as Replica
Rucio will try to make another replica at user specified destination RSE
Let’s imagine that last step fails, currently as simple as “if it does not work in 24h”
CRAB marks job_10 as failed.
User types “crab resubmit” (they ofter do w/o even looking at crab status first…)
CRAB resubmit job_10 and a new out_10.root is created on /store/temp/user at site S2
This file will have same size as previous one, but since root files contain timestamps, checksum will be different.
When CRAB tries to insert the new file Replica in Rucio for the existing DID (DID is like user.belfo:out_10.root) to indicate the new physical file at the new location, Rucio will refuse to insert it (or to transfer it later I am not sure) since checksum is not matching.
-Solution is to remove the initial Replica from Rucio before inserting the new one. CRAB code knows that this is a resubmission of job_10, but does not have enough privileges in Rucio to remove replicas.
-There are details to be clarified like : what if S2 = S1 ? Or: imagine there is no resubmission, when will Rucio stop trying to replicate out_10.root from S1 to destination ? (note: S1 will be physically gone after a couple weeks max).
Katy - possible option: Allow CRAB/Rucio to try to write the output file to the ‘home RSE’ for longer than the current time period, to give more time to fix issues at the site. If not successful, consider appending a ‘retry number’ to the CRAB output file name so that the name is unique.
The alternative of allowing CRAB to do deletions seems risky?
Stefano: there are two different problems here
1.) files in tmp may be removed but some automated cleanup (aka crontab) before Rucio manges to replicate them to home RSE. Not very likely since it means failing for > 2 weeks.
2.) how to deal with resubmissions done after output has been registered in Rucio (which may be a consequence of giving up on xfer because of 1. ) : one idea to be explored is to change job resubmission code to identify this event and in that case remove current LFN from Rucio dataset (so it is left w/o rules, i.e. garbage-collectible) and create a new LFN for current job (myoutput_N_1.root instead of myoutput_1.root) so that there is no need to touch existing DIDs in Rucio (neither remote, neither change checksum).
Note after the meeting: one major cause of current xfer failures is users running over quota, which now leads to files getting lost and needing resubmission. But if Stefano understand correctly, in the new model it will be enough to make room at destination (free quota) and Rucio will automatically retry and transfer all files, nothing getting lost.
Back to this we can do in steps:
Increase CRAB PJ timeout for RUcio stageout from 24h to 7days (on Stefano)
See if this is enough to make most jobs successful
Do not worry about quota (yet) assuming that user will notice and make space before the 3d timeout
Look into che code change above, i.e. if we give up on Rucio moving out_1.root we do:
Remove out_1.root from the dataset (and the rule)
Ask rucio to purge replicas (suggestion from Kathy, to be investigated what exactly does and how to use it)
Change LFN name for resubmitted job to out_1_1.root
Verify that b. Will prevent from having both out_1.root and out_1_1.root on user RSE which would be very confusing
Would be great if we could make 4. a long term project for edge, rare use cases since changes are deep in the current machinery and will take time to proper validate. Also a chance to rewrite cmscp.py
Jobs which hit the ASO timeout should be tagged by exit code 60317 (very rare indeed, let’s verify that things work and this is properly set when needed)
This plan has the good point that requires no changes to Rucio.

belforte · 2022-11-16T11:55:20Z

This should be closed by pointing to current plan. As soon as I find where we wrote that up.

belforte · 2023-08-28T15:33:50Z

In the end it is not possible to overwrite a file replica, since hte initial checksum is stored as Rucio metadata connected to that DID and it can never change, not even if the DID were removed and created again.
See discussion with @dynamic-entropy in dmwm/CMSRucio#343
Therefore we decided not to support resubmission of jobs failed during ASO-via-Rucio phase. If Rucio can not make it within 7 days, we will cleanup and ask user to run a recovery task which will guarantee new file names and hence no Rucio metadata conflict.

The important point is that ASO is halted after a crab kill.
We need to ensure that crab kill stops all balls from rolling also in the scope of #6540

Let's follow up in #6540

belforte · 2023-08-28T15:54:59Z

so we close this with:

when using Rucio for ASO jobs which completed successfully on grid can not be resubmitted any more

belforte self-assigned this Sep 27, 2022

belforte added Priority: High RUCIO CRAB-Rucio integration labels Sep 27, 2022

belforte added the Status: On Hold label Nov 11, 2022

belforte removed the Status: On Hold label Nov 16, 2022

novicecpp mentioned this issue May 11, 2023

fill properly tm_dbs_blockname and tm_block_complete in RUCIO_Transfers #7524

Closed

belforte mentioned this issue May 12, 2023

GOALS for 2023 #7635

Closed

8 tasks

belforte closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resubmissions when using Rucio for ASO #7404

resubmissions when using Rucio for ASO #7404

belforte commented Sep 27, 2022

belforte commented Nov 16, 2022

belforte commented Aug 28, 2023 •

edited

Loading

belforte commented Aug 28, 2023

resubmissions when using Rucio for ASO #7404

resubmissions when using Rucio for ASO #7404

Comments

belforte commented Sep 27, 2022

belforte commented Nov 16, 2022

belforte commented Aug 28, 2023 • edited Loading

belforte commented Aug 28, 2023

belforte commented Aug 28, 2023 •

edited

Loading