Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Make managing the artifact storage and cache available in the UI #8104

Closed
TobiasGoerke opened this issue Aug 4, 2022 · 8 comments
Labels
area/backend area/components area/frontend kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@TobiasGoerke
Copy link
Contributor

I'd like users to be able to manage the artifacts their pipelines have created. This involves disabling or invalidating the cache for certain artifacts so that a pipeline can be re-executed without having to touch code.

I've first mentioned this feature here(#7939) and posted the following GIF, exemplifying how such a feature could look like.
kubeflow_cache

Also, I've created a design document that addresses all details. If you like the feature (or don't), please leave comments here on directly in the document.

Feature Area

/area frontend
/area backend
/area components


Love this idea? Give it a 👍.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Aug 10, 2022

Is it also possible to do this from the runs page? E.g. if you delete a run it tries to delete the artifacts and if you delete an experiment it deletes all runs and therefore all corresponding artifacts. This is what most datascientists expect when they delete a run. You are using ml-metadata which is not namespace isolated yet and therefore 100 % insecure #4790. I am also to open to have it in the insecure ml-metadata part as an addition, but it should be at least available in the currently secure parts, that are structured by runs and experiments.

The API belongs in the apiserver and as suggested above we could implement this without UI or API schema changes. We just have to modify the delete run API call to call a new function delete_cache(run) at the end that deletes the corresponding artifacts from the database and or minio as shown in my PR. This is what most datascientists expect when they delete a run.

If desired we can also expose in addition a proper AUTHENTICATED #7819 API delete_cache(run), invalidate_cache(run) for more specialized approaches and the invalidate/delete button in the UI . Then everybody can also automate it as desired. And last but not least maybe environmant variables DELETE_CACHE_ON_RUN_DELETION=TRUE and or INVALIDATE_CACHE_ON_RUN_DELETION=TRUE in pipeline-install-config that are mapped to and respected by the apiserver.

This approach covers the basic needs for most users and provides flexibility for advanced usecases.

@chensun @difince @TobiasGoerke what do you think?

@TobiasGoerke
Copy link
Contributor Author

Any way users can delete artifacts and invalidate caches in the UI is fine for me. Your suggestion to not offer these features separately might even be more user-friendly. As you mentioned, your approach wouldn't require modifying the frontend and is more secure (for now, at least).

However, what about multiple runs sharing the same cache? Deleting any older run would force newer runs to lose their shared cache / artifacts, too

@juliusvonkohout
Copy link
Member

Any way users can delete artifacts and invalidate caches in the UI is fine for me. Your suggestion to not offer these features separately might even be more user-friendly. As you mentioned, your approach wouldn't require modifying the frontend and is more secure (for now, at least).

However, what about multiple runs sharing the same cache? Deleting any older run would force newer runs to lose their shared cache / artifacts, too

Is this a real problem? The same happens if you would invalidate a single artifact from the ml-metadata UI. It can always affect later runs that used the same artifact as cache.

You should definitely present it here Kubeflow Pipelines Community Meeting (PST AM)
https://meet.google.com/jkr-dupp-wwm and register it here https://docs.google.com/document/d/1cHAdK1FoGEbuQ-Rl6adBDL5W2YpDiUbnMLIwmoXBoAU/edit first

Maybe you can reach @chensun or @zijianjoy on slack to discuss it with them first.

@difince
Copy link
Member

difince commented Aug 17, 2022

E.g. if you delete a run it tries to delete the artifacts and if you delete an experiment it deletes all runs and therefore all corresponding artifacts.

This does not happen really, doesn't it?
I have opened a few issues you may find relevant in some ways to the current issue:

Your two approaches make sense to me, but I'm leaning more toward what Julius suggests because of the existing infrastructure - security /namespace isolation...
But just to add that for the first time, I took a look at the ml metadata so .. :)

@TobiasGoerke
Copy link
Contributor Author

E.g. if you delete a run it tries to delete the artifacts and if you delete an experiment it deletes all runs and therefore all corresponding artifacts.

This does not happen really, doesn't it? I have opened a few issues you may find relevant in some ways to the current issue:

Your two approaches make sense to me, but I'm leaning more toward what Julius suggests because of the existing infrastructure - security /namespace isolation... But just to add that for the first time, I took a look at the ml metadata so .. :)

Thanks for your reply, @difince. I've also been inclined to implement @juliusvonkohout's idea. However, turns out it is impossible to deduce cache entries in the db from runs. There simply is no information that could be used to match these. In case you have an idea how this could work, I'd be very glad to hear it!

Given this limitation, I've decided to take a different approach and simply make disabling caching entries available in the UI (see here. While this doesn't delete database entries once they were created, there are other open PRs, that should take care of overrunning databases.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Aug 23, 2022

I discussed with @TobiasGoerke that we go for a disable cache switch in the pipeline run UI and a slightly modified version of #7939 (comment) . If the maximum cache staleness leads to an empty list from

var executionCaches []*model.ExecutionCache
then we should scan the database and delete all entries older than max_cache_staleness. This ensures that it is still performant in very large installations and solves another long standing issue of an indefinitely growing cachedb.

So TWO environment variables for the cache-server that provide a default (if the pipeline does not have values) AND maximum (larger values in the pipeline are ignored) cache staleness. Then an administrator can set the expiration date on his Minio/S3/GCS storage backend to the same value as the maximum cache staleness and provide a sensible staleness default value for its users pipelines. We limit the user-set value int he pipeline definition to the maximum value from the administrator. The users also do not need to recompile existing pipelines anymore because they can disable the cache from the UI. I think setting the exact cache duration from the UI is overkill and a disable/enable switch is enough.

I think this covers most usecases, is independent of the storage backend and rather easy to implement. Tobias Goerke already has a POC master...TobiasGoerke:pipelines:master #8177 for the UI change.

@chensun this is also very much in line what google wants or do you think different?

Copy link

github-actions bot commented Mar 6, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 6, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend area/components area/frontend kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

3 participants