Production deploy can cause site-wide downtime #316

zwolf · 2022-09-08T16:50:47Z

During a production deploy of the http-frontend this week, we started getting service down alerts. There were two new http-frontend pods, both of which kubernetes was showing were unresponsive to liveness and readiness probes. This was causing them to be deemed unhealthy and be restarted, causing traffic routing to fail. Both of these bods showed nothing strange in their logs and seemed to be running fine outside of the k8s probe failure events. The deployment's rollout strategy isn't defined, which should mean that it uses the "rolling update" default behind their HPA. So if these pods were failing their health checks, why did the originals ever get shut down?

Maybe we can adjust the deployment to ensure that the new pods are passing the health checks before they're promoted. It's also possible that we could use staging as a more robust test environment for ensuring that traffic can be pushed through the nginx service, though this assumes that this problem would manifest in staging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production deploy can cause site-wide downtime #316

Production deploy can cause site-wide downtime #316

zwolf commented Sep 8, 2022

Production deploy can cause site-wide downtime #316

Production deploy can cause site-wide downtime #316

Comments

zwolf commented Sep 8, 2022