Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resubmissions when using Rucio for ASO #7404

Closed
belforte opened this issue Sep 27, 2022 · 3 comments
Closed

resubmissions when using Rucio for ASO #7404

belforte opened this issue Sep 27, 2022 · 3 comments
Assignees
Labels
Priority: High RUCIO CRAB-Rucio integration

Comments

@belforte
Copy link
Member

Below is very technical issue which we skip today NEED TO MOVE THIS TO GH
We discussed the resubmission problem. May have found a way out which does not require changes in Rucio. Katy volunteered to summarize.

We currently have a problem with Rucio stageout in case of job resubmissions after a file was produced. When jobs are run again the transfer of the output will fail immediately in Rucio due to checksum mismatch between original file and new one (checksum changes everytime even if one writes same file, most likely because EDM/root files contain timestamps) [2]. DIego is following up with Rucio developers. The checksum is a property of the replica, and one would need to edit or remove the replicat, but current policy only allows user to add DIDs and replicas in their scope, not to remove them, only Rucio admins can. Worst scenario is that we need to go back to Crab2 model of unique file name per job (a pain for users) which would need a wider discussion.
[1] https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsData
[2] detailed problem description:

  • Let’s imagine job #10 in a CRAB task runs.
  • It leaves out_10.root on local /store/temp/user at site S1
  • The file “name” is inserted in Rucio as a DID
  • The file “physical attributes” like checksum and RSE (site) are inserted in Rucio as Replica
  • Rucio will try to make another replica at user specified destination RSE
  • Let’s imagine that last step fails, currently as simple as “if it does not work in 24h”
  • CRAB marks job_10 as failed.
  • User types “crab resubmit” (they ofter do w/o even looking at crab status first…)
  • CRAB resubmit job_10 and a new out_10.root is created on /store/temp/user at site S2
  • This file will have same size as previous one, but since root files contain timestamps, checksum will be different.
  • When CRAB tries to insert the new file Replica in Rucio for the existing DID (DID is like user.belfo:out_10.root) to indicate the new physical file at the new location, Rucio will refuse to insert it (or to transfer it later I am not sure) since checksum is not matching.
    -Solution is to remove the initial Replica from Rucio before inserting the new one. CRAB code knows that this is a resubmission of job_10, but does not have enough privileges in Rucio to remove replicas.
    -There are details to be clarified like : what if S2 = S1 ? Or: imagine there is no resubmission, when will Rucio stop trying to replicate out_10.root from S1 to destination ? (note: S1 will be physically gone after a couple weeks max).
    Katy - possible option: Allow CRAB/Rucio to try to write the output file to the ‘home RSE’ for longer than the current time period, to give more time to fix issues at the site. If not successful, consider appending a ‘retry number’ to the CRAB output file name so that the name is unique.
    The alternative of allowing CRAB to do deletions seems risky?
    Stefano: there are two different problems here
    1.) files in tmp may be removed but some automated cleanup (aka crontab) before Rucio manges to replicate them to home RSE. Not very likely since it means failing for > 2 weeks.
    2.) how to deal with resubmissions done after output has been registered in Rucio (which may be a consequence of giving up on xfer because of 1. ) : one idea to be explored is to change job resubmission code to identify this event and in that case remove current LFN from Rucio dataset (so it is left w/o rules, i.e. garbage-collectible) and create a new LFN for current job (myoutput_N_1.root instead of myoutput_1.root) so that there is no need to touch existing DIDs in Rucio (neither remote, neither change checksum).
    Note after the meeting: one major cause of current xfer failures is users running over quota, which now leads to files getting lost and needing resubmission. But if Stefano understand correctly, in the new model it will be enough to make room at destination (free quota) and Rucio will automatically retry and transfer all files, nothing getting lost.
    Back to this we can do in steps:
    Increase CRAB PJ timeout for RUcio stageout from 24h to 7days (on Stefano)
    See if this is enough to make most jobs successful
    Do not worry about quota (yet) assuming that user will notice and make space before the 3d timeout
    Look into che code change above, i.e. if we give up on Rucio moving out_1.root we do:
    Remove out_1.root from the dataset (and the rule)
    Ask rucio to purge replicas (suggestion from Kathy, to be investigated what exactly does and how to use it)
    Change LFN name for resubmitted job to out_1_1.root
    Verify that b. Will prevent from having both out_1.root and out_1_1.root on user RSE which would be very confusing
    Would be great if we could make 4. a long term project for edge, rare use cases since changes are deep in the current machinery and will take time to proper validate. Also a chance to rewrite cmscp.py
    Jobs which hit the ASO timeout should be tagged by exit code 60317 (very rare indeed, let’s verify that things work and this is properly set when needed)
    This plan has the good point that requires no changes to Rucio.
@belforte belforte self-assigned this Sep 27, 2022
@belforte belforte added Priority: High RUCIO CRAB-Rucio integration labels Sep 27, 2022
@belforte
Copy link
Member Author

This should be closed by pointing to current plan. As soon as I find where we wrote that up.

@belforte
Copy link
Member Author

belforte commented Aug 28, 2023

In the end it is not possible to overwrite a file replica, since hte initial checksum is stored as Rucio metadata connected to that DID and it can never change, not even if the DID were removed and created again.
See discussion with @dynamic-entropy in dmwm/CMSRucio#343
Therefore we decided not to support resubmission of jobs failed during ASO-via-Rucio phase. If Rucio can not make it within 7 days, we will cleanup and ask user to run a recovery task which will guarantee new file names and hence no Rucio metadata conflict.

The important point is that ASO is halted after a crab kill.
We need to ensure that crab kill stops all balls from rolling also in the scope of #6540

Let's follow up in #6540

@belforte
Copy link
Member Author

so we close this with:

when using Rucio for ASO jobs which completed successfully on grid can not be resubmitted any more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High RUCIO CRAB-Rucio integration
Projects
None yet
Development

No branches or pull requests

1 participant