You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During a production deploy of the http-frontend this week, we started getting service down alerts. There were two new http-frontend pods, both of which kubernetes was showing were unresponsive to liveness and readiness probes. This was causing them to be deemed unhealthy and be restarted, causing traffic routing to fail. Both of these bods showed nothing strange in their logs and seemed to be running fine outside of the k8s probe failure events. The deployment's rollout strategy isn't defined, which should mean that it uses the "rolling update" default behind their HPA. So if these pods were failing their health checks, why did the originals ever get shut down?
Maybe we can adjust the deployment to ensure that the new pods are passing the health checks before they're promoted. It's also possible that we could use staging as a more robust test environment for ensuring that traffic can be pushed through the nginx service, though this assumes that this problem would manifest in staging.
The text was updated successfully, but these errors were encountered:
During a production deploy of the http-frontend this week, we started getting service down alerts. There were two new http-frontend pods, both of which kubernetes was showing were unresponsive to liveness and readiness probes. This was causing them to be deemed unhealthy and be restarted, causing traffic routing to fail. Both of these bods showed nothing strange in their logs and seemed to be running fine outside of the k8s probe failure events. The deployment's rollout strategy isn't defined, which should mean that it uses the "rolling update" default behind their HPA. So if these pods were failing their health checks, why did the originals ever get shut down?
Maybe we can adjust the deployment to ensure that the new pods are passing the health checks before they're promoted. It's also possible that we could use staging as a more robust test environment for ensuring that traffic can be pushed through the nginx service, though this assumes that this problem would manifest in staging.
The text was updated successfully, but these errors were encountered: