-
Notifications
You must be signed in to change notification settings - Fork 38
Rucio for CRAB
This should be written at the end or in a separate doc, not take space at the top here
- Reduce/remove the dependency on direct communication with FTS. Hand to DM operators to worry about Rucio rules completion.
- No need for FTS knowledge in CRAB Team
- Simple, faster, more solid ASO code.
- Make it easy for users to have data on Rucio-managed storage, for sharing and managing as a dataset.
- Support for /store/group/rucio
- Need to keep support for CERNBOX
- /store/user/ will never go away
- Not every user wil have sizeable quota on their “local RSE”
P1) Rucio DIDs metadata (file checksum e.g.) can never be changed. Files can get lost, but can not be created again unless with identical binary content i.e. exact copies
P2) CRAB Tasks have a finite lifetime O(1 months), we can’t let ASO scripts run longer
C1) Once a job output has been registered in Rucio, the job can not the resubmitted “as is”. The name of the output file needs to change so that it ends as a new DID
C2) Successful jobs can not be simply resubmitted since they will either fail (produce an output with same name) or lead to confusion (add to output a file with different name, but same content).
C3) At some time “crab kill” may have to be issued, by user or by task lifetime expiration, and all ongoing transfers aborted.
No resubmissions ! Implement ASO via Rucio as detailed in https://github.com/dmwm/CRABServer/wiki/ASO-via-Rucio disallowing resubmission of jobs which fail or are killed in the ASO part. When a job is killed while in ASO, declare as bad the output file replica in Temp_RSE and remove DID from dataset, If some data are missing in the output of a task, users will create a recovery task. We will make sure that it is easy to create a recovery task. We keep current FTS-ASO as well and “see how it goes” before deciding if to restrict it to e.g. only CERNBOX and T3’s
Support resubmission of jobs failed/killed during ASO step by changing output file name When resubmission is done:
- Mark all replicas as bad and remove from dataset for output of the job
- Change _jobId in file name to _jobId_retryCount (retry count is already tracked and e.g. used for naming job stdout and fjr )
- This will work also for jobs which completed !
- The crucial point is to guarantee that output container, directory /store/user/rucio/… and DBS dataset with output data all contain only one file for each job, even if oddly named. There is no guarantee in Rucio that files will not be there. We may need to do it by ourselves (gfal-rm in RUCIO_Transfer.py scripts?).
Implement ATLAS scratch RSE model (with improvements ?):
- Job output is copied from WN to local RSE using “Rucio” credential already with final LFN /store/user/rucio/ . “we just put the files in datadisk, in the user’s scope, and with a 14 day lifetime rule”. This rule will be for CRAB account. So we can give that account a quota on that RSE. Effectively implementing a scratch RSE.
- Those files are then immediately available for reading and can be registered in Rucio container(s) and DBS
- We could call things “DONE”, at that point and limit ASO step to registration in Rucio. At time of registration in Rucio we may also create a rule to replicate on user-specified RSE (‘a la ATLAS, ToBeVerified) and leave it to user to follow up on it.
- The replication rule will be with user account. Replicas at destination will be charged to user quota on that RSE.
- Originals in local RSE will be locked by Rucio until replication is completed, as long as it takes.
- We could give users a command (or a CRAB option) to eventually copy files to their local RSE in /store/user/
- In a way this could remove all “file movement work” in ASO, and replace it with a few crab commands which manipulate Rucio rules in user account.
- And we get rid of TempRSE’s (maybe the only real gain ?)
Yet there’s so many tweaks and details to be worked out, that eventually it may not be any simpler than the baseline ! The main benefit here would be that we can put everybody’s output on /store/user/rucio first, and have same data flow for everybody. But putting output on /store/user/ on local RSE will still require FTS, unless we hand to users to run “rucio download”. Of course if files are on /store/user/rucio/ in same RSE as /store/user/ copying should happened swiftly and safely and an FTS job get done in very little time… To be seen.
In other words, we could easily reproduce the ATLAS “scratch RSE” but it is not trivial to give user the functionality they are used to where files appears automatically in their favorite local directory.