Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facilitated storage compute access #29

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

volodymyrss
Copy link

No description provided.

@volodymyrss
Copy link
Author

volodymyrss commented May 15, 2023

Hi @rokroskar @Panaetius ! As agreed, here is the proposal. I did not fill some fields since they would be clarified later, after a meeting. Does it make a little bit of sense? Should we also discuss it on our monthly meeting this Wednesday?

@rokroskar
Copy link
Member

Thanks @volodymyrss for contributing this RFC! From my PoV it's still a bit too vague - can you add some details about specific services or point to existing implementations that you are thinking of? In principle something like the JH services you mention could be possible, i.e. we already run proxies in this sort of mode.

@volodymyrss
Copy link
Author

volodymyrss commented May 15, 2023

Thanks @volodymyrss for contributing this RFC! From my PoV it's still a bit too vague - can you add some details about specific services or point to existing implementations that you are thinking of?

I listed services in the first sentence of the first paragraph in brackets very briefly. I now add them in a list in this section. I avoid to put actual endpoints to reduce exposure.

In principle something like the JH services you mention could be possible, i.e. we already run proxies in this sort of mode.

Ok, good to hear. How would they be selected, during session start? There would be catalog of these services? Core and contributed?

What about another possible solution, when renku is just setting some variables specifying external service endpoint and credentials? This could be easier.

edit: it is also the case that since the services might be both restricted and domain-specific, they should probably be only visible to some limited community. I think you mentioned you were thinking about making some domain/project specific resource allocation? Would specialized services visible only for some projects/domains work in the same way?

These jupyterhub services differ from some processes running in the session in that they have higher privileges, so they are managed by hub administrator. We currently use one service like that for finding and downloading some data.
I wonder how would it work in renkulab, would it be managed by renku admin, but contributed by "us", is this even feasible?

An example is ARC cluster. It uses some specialized client software which can be installed in the service container but may be tricky to keep in all user sessions. A "side-kick" service would receive simple HTTP request from the user in the session and transform it into ARC job.

For cases like WebDAV, extra "side-kick" service is useful only to transform credentials somehow, but the credentials could be also provided in environment variables to the session.

@Panaetius
Copy link
Member

I think both solutions/use-cases, credentials storage and custom sidecar services have merit and are feasible, and I could see us implementing both to support different use-cases.

We are already exploring using Vault to store credentials and we should be able to inject them into sessions as e.g. environment variables. But I think what would be nice is to have an dynamic egress proxy that allows for credentials injection. So you could set up rules per project/user like "Requests to example.com should get the token from, secret my-token injected in the Authorization: Bearer .. header, with the secret coming from vault". I think this could be done in a way that works for most uses, with users (or admins?) being able to define the rules per project. This would also be nice in that it could allow anonymous sessions access to restricted data, if set up by the project owner, without exposing secrets. This would differ a bit from what is proposed here in that we'd have a single, dedicated sidecar container that handles proxying for all kinds of requests, instead of just injecting secrets or having a sidecar per request type.

For more generic sidecar containers that actually perform actions (the thing similar to jupyterhub services), we probably don't want users to be able to roll their own, but having something platform wide like we have with project templates would also just end up being noise for users, so having it defined by admins/superusers makes sense to me. The resource access control service we're currently working on could be a nice fit, it has some very similar behavior (admins define what resources a user has access to and the user can pick from those when launching a session). So we could extend that with custom sidecars or have a separate service with essentially the same functionality for custom sidecars. Then an admin could say "User X has access to custom sidecars Y and Z" and the user can pick on session launch (or just by default?) whether to start those. We would need to define some API that custom sidecars need to follow, at the most basic, having a healthcheck endpoint for amalthea to watch and whatever is needed by the persistent sessions changes currently being worked on (so the sidecar can be shutdown/started alongside the session as is appropriate). But it'd be up to communities to write these custom sidecars. I would limit these to just being able to specify a Dockerfile and maybe some fixed settings for a sidecar.

There is probably a third class of use-cases that can't be solved by the above, like having to install some plugin in the cluster to mount some specific not officially supported storage in a session. We don't want administrators/users to be able to make these kinds of customizations, for platform stability reasons. So these we'd have to check on a case-by-case basis.

But I think the generic proxy and admin-defined custom sidecars are both feasible and both useful. I'd probably go for implementing the proxy first since we have a lot of the parts already.

@volodymyrss
Copy link
Author

Thank you @Panaetius for the analysis, it makes sense to me.

I wonder about this concept of renku superusers, does it exist already?

How do we proceed to assess the effort and possible timeline? We'll discuss formal aspects on Thursday, so it's very good that we have this technical basis progressing.

Just a comment on mounting: this is an option some people like to see since it is familiar. But in practice it is possible to get similar experience by exploring storage through an API. Even in shell we use sometimes some pseudo-ls. Sometimes, it can be even advantageous, since it requires more purposeful data transfers.
Nevertheless, if there is a way to make (at least read-only) mount it could be seen very positively by some users.

@volodymyrss volodymyrss marked this pull request as ready for review July 13, 2023 13:11
@volodymyrss
Copy link
Author

I wonder if this here should be adapted given that #31 will provide additional features which can be relied upon. Or should this remain as it is, since it explains first of all use case, which remains the same?

@olevski
Copy link
Member

olevski commented Nov 13, 2023

@volodymyrss I read through the RFC but I still have a lot of questions I want to clarify.

Here is a list of user stories I extracted from the RFC and our meetings. I hope you can review and answer the questions, and also let me know if the limitations I posted here are acceptable.

As a SmartSky user I want to:

  • Add different types of R-clone compatible storage to my Renku project
  • Browse or access data from different R-clone compatible storage types within a Renku session
  • Store credentials for accessing different R-clone compatible storage types within my Renku project
  • Launch from my Renku project session into a HPC cluster
  • Easily create new projects where I will be able to launch into a HPC cluster

Questions and limitations:

  • We we will only support one type of HPC cluster / environment (which one?)
  • We will only support R-clone compatible storage
  • Mounting storage will require restarting the session
  • The workloads that run on the HPC cluster have to take care themselves for:
    • Downloading data from different storages
    • Authenticating with different storages
    • What is Renku's responsibility here and what is the responsibility of SmartSky users?
    • How will the environment for these workloads be defined and packaged? Whose responsibility is this?
    • Does Renku have to track the metadata from these workloads?
    • What happens to the results from these workloads, who is responsible for saving the results and where and how?

@volodymyrss
Copy link
Author

Questions and limitations:

* We we will only support one type of HPC cluster / environment (which one?)

We want to support several, with plugin interface. Several kinds of clusters, and also multiple actual clusters.

This is pointed out in here.

Please feel free also to make comments on the text with a PR or as you like, if you find that something is missing or unclear!

* We will only support R-clone compatible storage

Most of the cases will be, out of those quoted in the text, only rucio does not fit rclone at this time. We have to think if it is ok to ignore rucio, it might be. Maybe we expect/ask rucio to develop a suitable interface.

* Mounting storage will require restarting the session

I think this is understandable, if there is no other choice.

* The workloads that run on the HPC cluster have to take care themselves for:
  * Downloading data from different storages

Some compute backends (e.g. ARC) will fetch the data from any compatible remote storage by URL.
While others (e.g. FireCREST) will only accept some (local) storage (can be still identified by some URL).
In both cases, compute interface (at renku) should be aware of which storage data can be provided as input at which compute backend.
If necessary, renku should initiate the transfer to a storage compatible with a given compute backend.

  * Authenticating with different storages

When user authorizes renku to access a comprehensive compute backend, like ARC, the backend can also access storage on user's behalf.
When accessing FireCREST, the compute backend has access to the local storage where the data will be staged.
So in both cases renku does not need to take care of authentication between compute workloads and storage.

  * What is Renku's responsibility here and what is the responsibility of SmartSky users?

Responsibility where? Are you referring to previous two items in this list?

  * How will the environment for these workloads be defined and packaged? Whose responsibility is this?

Environment would be defined in a container. By default, the same container as used in renku session which is built on renku already, but with modified entrypoint (we do something similar already, and I know from UG meeting other users do too, maybe we can reach out to them).
This container can be converted to singularity in gitlab CI and uploaded to the storage (one of the storages we address here).
Specifying another container should be an option.

  * Does Renku have to track the metadata from these workloads?

Would be nice if feasible.

  * What happens to the results from these workloads, who is responsible for saving the results and where and how?

The results will be stored in a storage at the end of the execution.
They can be then explored in renku session like in any other compatible storage.

@volodymyrss
Copy link
Author

Hi @olevski , did you have a chance to consider my responses? Should I incorporate them as further changes to the RFC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants