Stopped model marked as invalid prevents from running a new one #740

jacekwachowiak · 2019-07-05T04:37:58Z

Sometimes when a model is removed, some information stays behind. When you try to run the stop_all_model_containers() an error will print out:

ClipperException: Received error status code: 400 and message: model with name 'sum-model' and version '1' is already marked as invalid'

The problem is that even after a longer wait (expecting the model to be under removal after that message) running get_all_models() does not return an empty list (or in general a list without this specific model). After that trying to launch the same model but with another version will lead to all queries being defaulted.
I tried multiple times to fix this but nothing works and I need to restart the cluster for the invalid model to go away. The pods are terminated correctly so it seems that Clipper retains some information that it shouldn't.

To clarify when it happens - I observed that scaling down from many replicas (20) to only a couple of them often leads to the model being disconnected and unresponsive. The natural step for me is to stop it and rerun it fresh with fewer replicas. Obviously if there is another model in the system, restarting Clipper completely is not the correct approach, but it is the only thing that works for me.

The text was updated successfully, but these errors were encountered:

withsmilo · 2019-07-05T04:46:08Z

@jacekwachowiak : It's strange thing. I think that some timing issue might occur your problem. Could you please share your scenario?

jacekwachowiak · 2019-07-05T04:54:18Z

Basically I start the cluster, run a model, scale it to X replicas, run different tests - usually sending different loads for a few minutes and then change the parameters - add more replicas, remove some, load another predict function and so on. I noticed that scaling down the replicas is risky, also sometimes one or more of the pods get disconnected and is not restarted automatically, I needed to kill it in the console so that it reconnects.

withsmilo · 2019-07-05T04:58:36Z

You're right. It seems to be risky to scale in/out your models. (I didn't it in my case) I will check it this weekend.

jacekwachowiak · 2019-07-05T05:05:50Z

I just confirmed that scaling to 10 replicas and back was ok, then I repeated that with 20 and when I scaled down all my queries return default values.
Update: The model could not be removed successfully.
There is another thing - when we scale down, the metrics for the removed replicas remain meaning that reading for example latency buckets will fail unless we exclude the values that are at NaN, since the average needs all values to be numeric. In my example I get 20 variables, 18 are blank=useless

withsmilo · 2019-07-05T05:35:17Z

I will patch the problem that the metric information for the removed replicas or models remains. I'm so sorry that I patched it to my Clipper already, but didn't create it as a PR.

withsmilo · 2019-07-09T23:09:59Z

@jacekwachowiak
So sorry about late reply.

I will patch the problem that the metric information for the removed replicas or models remains. I'm so sorry that I patched it to my Clipper already, but didn't create it as a PR.

I merged it already. (#711) So this is not the cause of the problem.

I think that supporting scaling down seems to be a difficult problem. (It is relatively easy to support scaling up.) If query_frontend does not communicate with the model container, it is difficult to determine if the container is deleted or if it is a temporary network problem. Clipper checks the connection to each model container every 10 seconds to resolve this problem and removes the associated resources if ping does not come for 30 seconds. Your problem appears to be a timing issue in this process. I am looking for a scenario that can reproduce your problem.

clipper/src/libclipper/src/rpc_service.cpp

Line 177 in 001f9d5

.count() > CONTAINER_ACTIVITY_TIMEOUT_MILLS) {

jacekwachowiak · 2019-07-16T00:47:57Z

Is there a recommended way how to gather metrics in that case? Because leaving aside the problem about removing the pods, when I want to take the average of all pods for latency purposes, since there were past records for the pods that are no longer here, the average will not work. And of course to use past data, the records for the now removed pods must remain but yet should not be detected if we ask for the data for example from last x minutes
Maybe this is a Prometheus design problem though, I am not sure

withsmilo · 2019-07-16T01:44:27Z

Current version of Clipper will delete all the metrics explicitly about the removed model pods. If you'd like to leave them, please block the following codes and rebuild the query_frontend image.

clipper/src/libclipper/include/clipper/task_executor.hpp

Lines 659 to 663 in 001f9d5

    
           if (delete_model_metric_if_necessary(model_id)) { 
        
             log_info_formatted(LOGGING_TAG_TASK_EXECUTOR, 
        
                                "Deleted metric for model: {} : {}", 
        
                                model_id.get_name(), model_id.get_id()); 
        
           }

jacekwachowiak · 2019-07-16T01:57:59Z

I'm sorry if this is a stupid question, but this removes the mention in Redis, right? The past metrics send to Prometheus will not be touched?

withsmilo · 2019-07-16T02:07:55Z

You're right.

jacekwachowiak · 2019-07-16T02:34:52Z

Then it does not fix the problem when Prometheus tries to calculate percentiles/averages by summing up each pod with the same name, even though some were removed. I guess it's not Clipper's problem, but the example for metrics taken from the Grafana dashboard uses some operations such as:

avg(rate(clipper_mc_end_to_end_latency_ms_sum[30s])/rate(clipper_mc_end_to_end_latency_ms_count[30s]))

which will not show anything unless all pods are connected and feeding new data, even one removed or disconnected will make the results void. Again, I am getting more and more convinced that it may be Prometheus responsibility (some more complicated query could be created to exclude the non-actual pods?), but filtering through the pods there is cumbersome

rkooo567 · 2019-07-16T03:41:19Z

@jacekwachowiak I also agree it should be the prometheus' responsibility although it would be useful to support from our side. Personally, the best solution now might be to write a cron job that cleans up metrics of removed pods from prometheus. Seems like similar things have been an issue in other open source programs as well (It is slightly different, but share the similar concern, fluent/fluent-plugin-prometheus#20)

jacekwachowiak · 2019-07-16T04:01:58Z

Yes, that's the same kind of problem

rkooo567 assigned withsmilo Jul 6, 2019

rkooo567 added the check-again label Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopped model marked as invalid prevents from running a new one #740

Stopped model marked as invalid prevents from running a new one #740

jacekwachowiak commented Jul 5, 2019 •

edited

Loading

withsmilo commented Jul 5, 2019

jacekwachowiak commented Jul 5, 2019

withsmilo commented Jul 5, 2019

jacekwachowiak commented Jul 5, 2019 •

edited

Loading

withsmilo commented Jul 5, 2019

withsmilo commented Jul 9, 2019 •

edited

Loading

jacekwachowiak commented Jul 16, 2019

withsmilo commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019

withsmilo commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019 •

edited

Loading

rkooo567 commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019

Stopped model marked as invalid prevents from running a new one #740

Stopped model marked as invalid prevents from running a new one #740

Comments

jacekwachowiak commented Jul 5, 2019 • edited Loading

withsmilo commented Jul 5, 2019

jacekwachowiak commented Jul 5, 2019

withsmilo commented Jul 5, 2019

jacekwachowiak commented Jul 5, 2019 • edited Loading

withsmilo commented Jul 5, 2019

withsmilo commented Jul 9, 2019 • edited Loading

jacekwachowiak commented Jul 16, 2019

withsmilo commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019

withsmilo commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019 • edited Loading

rkooo567 commented Jul 16, 2019

jacekwachowiak commented Jul 16, 2019

jacekwachowiak commented Jul 5, 2019 •

edited

Loading

jacekwachowiak commented Jul 5, 2019 •

edited

Loading

withsmilo commented Jul 9, 2019 •

edited

Loading

jacekwachowiak commented Jul 16, 2019 •

edited

Loading