Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopped model marked as invalid prevents from running a new one #740

Open
jacekwachowiak opened this issue Jul 5, 2019 · 13 comments
Open
Assignees

Comments

@jacekwachowiak
Copy link
Contributor

jacekwachowiak commented Jul 5, 2019

Sometimes when a model is removed, some information stays behind. When you try to run the stop_all_model_containers() an error will print out:

ClipperException: Received error status code: 400 and message: model with name 'sum-model' and version '1' is already marked as invalid'

The problem is that even after a longer wait (expecting the model to be under removal after that message) running get_all_models() does not return an empty list (or in general a list without this specific model). After that trying to launch the same model but with another version will lead to all queries being defaulted.
I tried multiple times to fix this but nothing works and I need to restart the cluster for the invalid model to go away. The pods are terminated correctly so it seems that Clipper retains some information that it shouldn't.

To clarify when it happens - I observed that scaling down from many replicas (20) to only a couple of them often leads to the model being disconnected and unresponsive. The natural step for me is to stop it and rerun it fresh with fewer replicas. Obviously if there is another model in the system, restarting Clipper completely is not the correct approach, but it is the only thing that works for me.

@withsmilo
Copy link
Collaborator

@jacekwachowiak : It's strange thing. I think that some timing issue might occur your problem. Could you please share your scenario?

@jacekwachowiak
Copy link
Contributor Author

Basically I start the cluster, run a model, scale it to X replicas, run different tests - usually sending different loads for a few minutes and then change the parameters - add more replicas, remove some, load another predict function and so on. I noticed that scaling down the replicas is risky, also sometimes one or more of the pods get disconnected and is not restarted automatically, I needed to kill it in the console so that it reconnects.

@withsmilo
Copy link
Collaborator

You're right. It seems to be risky to scale in/out your models. (I didn't it in my case) I will check it this weekend.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented Jul 5, 2019

I just confirmed that scaling to 10 replicas and back was ok, then I repeated that with 20 and when I scaled down all my queries return default values.
Update: The model could not be removed successfully.
There is another thing - when we scale down, the metrics for the removed replicas remain meaning that reading for example latency buckets will fail unless we exclude the values that are at NaN, since the average needs all values to be numeric. In my example I get 20 variables, 18 are blank=useless

@withsmilo
Copy link
Collaborator

I will patch the problem that the metric information for the removed replicas or models remains. I'm so sorry that I patched it to my Clipper already, but didn't create it as a PR.

@withsmilo
Copy link
Collaborator

withsmilo commented Jul 9, 2019

@jacekwachowiak
So sorry about late reply.

I will patch the problem that the metric information for the removed replicas or models remains. I'm so sorry that I patched it to my Clipper already, but didn't create it as a PR.

I merged it already. (#711) So this is not the cause of the problem.

I think that supporting scaling down seems to be a difficult problem. (It is relatively easy to support scaling up.) If query_frontend does not communicate with the model container, it is difficult to determine if the container is deleted or if it is a temporary network problem. Clipper checks the connection to each model container every 10 seconds to resolve this problem and removes the associated resources if ping does not come for 30 seconds. Your problem appears to be a timing issue in this process. I am looking for a scenario that can reproduce your problem.

.count() > CONTAINER_ACTIVITY_TIMEOUT_MILLS) {

@jacekwachowiak
Copy link
Contributor Author

Is there a recommended way how to gather metrics in that case? Because leaving aside the problem about removing the pods, when I want to take the average of all pods for latency purposes, since there were past records for the pods that are no longer here, the average will not work. And of course to use past data, the records for the now removed pods must remain but yet should not be detected if we ask for the data for example from last x minutes
Maybe this is a Prometheus design problem though, I am not sure

@withsmilo
Copy link
Collaborator

Current version of Clipper will delete all the metrics explicitly about the removed model pods. If you'd like to leave them, please block the following codes and rebuild the query_frontend image.

if (delete_model_metric_if_necessary(model_id)) {
log_info_formatted(LOGGING_TAG_TASK_EXECUTOR,
"Deleted metric for model: {} : {}",
model_id.get_name(), model_id.get_id());
}

@jacekwachowiak
Copy link
Contributor Author

I'm sorry if this is a stupid question, but this removes the mention in Redis, right? The past metrics send to Prometheus will not be touched?

@withsmilo
Copy link
Collaborator

You're right.

@jacekwachowiak
Copy link
Contributor Author

jacekwachowiak commented Jul 16, 2019

Then it does not fix the problem when Prometheus tries to calculate percentiles/averages by summing up each pod with the same name, even though some were removed. I guess it's not Clipper's problem, but the example for metrics taken from the Grafana dashboard uses some operations such as:

avg(rate(clipper_mc_end_to_end_latency_ms_sum[30s])/rate(clipper_mc_end_to_end_latency_ms_count[30s]))

which will not show anything unless all pods are connected and feeding new data, even one removed or disconnected will make the results void. Again, I am getting more and more convinced that it may be Prometheus responsibility (some more complicated query could be created to exclude the non-actual pods?), but filtering through the pods there is cumbersome

@rkooo567
Copy link
Collaborator

@jacekwachowiak I also agree it should be the prometheus' responsibility although it would be useful to support from our side. Personally, the best solution now might be to write a cron job that cleans up metrics of removed pods from prometheus. Seems like similar things have been an issue in other open source programs as well (It is slightly different, but share the similar concern, fluent/fluent-plugin-prometheus#20)

@jacekwachowiak
Copy link
Contributor Author

Yes, that's the same kind of problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants