-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopped model marked as invalid prevents from running a new one #740
Comments
@jacekwachowiak : It's strange thing. I think that some timing issue might occur your problem. Could you please share your scenario? |
Basically I start the cluster, run a model, scale it to X replicas, run different tests - usually sending different loads for a few minutes and then change the parameters - add more replicas, remove some, load another predict function and so on. I noticed that scaling down the replicas is risky, also sometimes one or more of the pods get disconnected and is not restarted automatically, I needed to kill it in the console so that it reconnects. |
You're right. It seems to be risky to scale in/out your models. (I didn't it in my case) I will check it this weekend. |
I just confirmed that scaling to 10 replicas and back was ok, then I repeated that with 20 and when I scaled down all my queries return default values. |
I will patch the problem that the metric information for the removed replicas or models remains. I'm so sorry that I patched it to my Clipper already, but didn't create it as a PR. |
@jacekwachowiak
I merged it already. (#711) So this is not the cause of the problem. I think that supporting scaling down seems to be a difficult problem. (It is relatively easy to support scaling up.) If query_frontend does not communicate with the model container, it is difficult to determine if the container is deleted or if it is a temporary network problem. Clipper checks the connection to each model container every 10 seconds to resolve this problem and removes the associated resources if ping does not come for 30 seconds. Your problem appears to be a timing issue in this process. I am looking for a scenario that can reproduce your problem. clipper/src/libclipper/src/rpc_service.cpp Line 177 in 001f9d5
|
Is there a recommended way how to gather metrics in that case? Because leaving aside the problem about removing the pods, when I want to take the average of all pods for latency purposes, since there were past records for the pods that are no longer here, the average will not work. And of course to use past data, the records for the now removed pods must remain but yet should not be detected if we ask for the data for example from last x minutes |
Current version of Clipper will delete all the metrics explicitly about the removed model pods. If you'd like to leave them, please block the following codes and rebuild the query_frontend image. clipper/src/libclipper/include/clipper/task_executor.hpp Lines 659 to 663 in 001f9d5
|
I'm sorry if this is a stupid question, but this removes the mention in Redis, right? The past metrics send to Prometheus will not be touched? |
You're right. |
Then it does not fix the problem when Prometheus tries to calculate percentiles/averages by summing up each pod with the same name, even though some were removed. I guess it's not Clipper's problem, but the example for metrics taken from the Grafana dashboard uses some operations such as:
which will not show anything unless all pods are connected and feeding new data, even one removed or disconnected will make the results void. Again, I am getting more and more convinced that it may be Prometheus responsibility (some more complicated query could be created to exclude the non-actual pods?), but filtering through the pods there is cumbersome |
@jacekwachowiak I also agree it should be the prometheus' responsibility although it would be useful to support from our side. Personally, the best solution now might be to write a cron job that cleans up metrics of removed pods from prometheus. Seems like similar things have been an issue in other open source programs as well (It is slightly different, but share the similar concern, fluent/fluent-plugin-prometheus#20) |
Yes, that's the same kind of problem |
Sometimes when a model is removed, some information stays behind. When you try to run the
stop_all_model_containers()
an error will print out:The problem is that even after a longer wait (expecting the model to be under removal after that message) running
get_all_models()
does not return an empty list (or in general a list without this specific model). After that trying to launch the same model but with another version will lead to all queries being defaulted.I tried multiple times to fix this but nothing works and I need to restart the cluster for the invalid model to go away. The pods are terminated correctly so it seems that Clipper retains some information that it shouldn't.
To clarify when it happens - I observed that scaling down from many replicas (20) to only a couple of them often leads to the model being disconnected and unresponsive. The natural step for me is to stop it and rerun it fresh with fewer replicas. Obviously if there is another model in the system, restarting Clipper completely is not the correct approach, but it is the only thing that works for me.
The text was updated successfully, but these errors were encountered: