CRABCache replacement with S3

Action items and details are tracked as a GH project

Here is a link to Initial requirement list prepared to review our use case with CERN IT Storage people (Dan Van der Ster and Enirco Bocchi)

General ideas for using S3 in CRAB (possibly beyond CRABCache replacement) and notes about S3 use are in S3 for CRAB

TO DO

Need to think carefully about how we will switch this production and make sure there's needed hooks to do it gradually and transparently
debug files are currently saved "like a sandbox" with horrible hash etc. Shall we keep, or take this change to move to a humnaly named, task-specific, file ? They are currently uploaded before task name is available, so it takes a bit of work to change. Decided to keep as it was. Too dangerous to have a race with TW. To make access easier can possibly add a call after task was submitted to copy the object to a different S3 one hint, or simply upload again as a debugiles object

Design

we create a few buckets called "CRABCacheProd", "CRABCachePreprod" etc.. All objects will have an expiration time of 31 days.
- Each CRABServer REST instance (host) will point to one such bucket, selected via the mode switch in RESTConfig json
we keep the current limit at 120 MB per sandbox
there will be no user quota and we will deprecate the "crab purge" command
we will monitor (at low rate) usage by user and will take actions if needed, but expect that with a large enough storage container the system will self-regulate
All objects will be private
CRABServer will hold the keys via a Kubernetes secret
operators can use something like Ansible Vault to access keys in a safe way for testing/debugging/devel. from lxplus.
All operations will be done via CRABServer REST API's so we can do via browser using CMSWEB auth.
CRABClient and TaskWorker will ask the REST server for a pre-signed URL whenever they want to upload a file and will then do an HTTP POST (e.g. via curl)
Downloads will also be done via CRABServer REST API's so we have CMSWEB authentication and username/role validation at hand.
We will support two kind of file access:
1. "retrieve" : CRABServer will fetch the object to memory and serve it to the client
2. "download" : CRABServer will return to the client a PreSigned URL to fetch the file
download PreSignedUrl will have a short expiration for sandboxes (and only handed to owner or operators)
download PreSignedUrls for logs can have last 1 month and shared around by users as they see fit.
read-only access to non-sensitive files. since those files will be accessible to the full world, it is unclear if we have anything that is safe to expose this way. To be reviews if ad when CERN puts SSO in front of S3.

Bucket organization

see S3 guide Organizing Objects

we will have this structure:

CRABCacheProd/<username>/sanboxes/<sandboxid>
CRABCacheProd/<username>/<taskname>/[ClientLog.txt|TWLog.txt|DebugFiles.tgz]

sandboxes will be uploaded as before using as name a hash which identify unique tarballs. CRABServer REST will check if it exists already in the bucket before handing a PreSignedURL to the Client so that useless uploads can be skipped (as it happens now)
debugfile tarball is unique to each task and will be uploaded once
client and taskworker logs are uinique to each task and will be overwritten every time

Implementation

https://github.com/dmwm/CRABServer/blob/bfad557db2d7174b3499098af0848681beb180de/src/python/CRABInterface/RESTCache.py#L30-L71

Utility functions

https://github.com/dmwm/CRABServer/blob/2555e84a4dd39c4e292b4cf76362ad60f5a00a2f/src/python/ServerUtilities.py#L596-L608

Provide feedback

Saved searches

Use saved searches to filter your results more quickly