Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production deploy can cause site-wide downtime #316

Open
zwolf opened this issue Sep 8, 2022 · 0 comments
Open

Production deploy can cause site-wide downtime #316

zwolf opened this issue Sep 8, 2022 · 0 comments

Comments

@zwolf
Copy link
Member

zwolf commented Sep 8, 2022

During a production deploy of the http-frontend this week, we started getting service down alerts. There were two new http-frontend pods, both of which kubernetes was showing were unresponsive to liveness and readiness probes. This was causing them to be deemed unhealthy and be restarted, causing traffic routing to fail. Both of these bods showed nothing strange in their logs and seemed to be running fine outside of the k8s probe failure events. The deployment's rollout strategy isn't defined, which should mean that it uses the "rolling update" default behind their HPA. So if these pods were failing their health checks, why did the originals ever get shut down?

Maybe we can adjust the deployment to ensure that the new pods are passing the health checks before they're promoted. It's also possible that we could use staging as a more robust test environment for ensuring that traffic can be pushed through the nginx service, though this assumes that this problem would manifest in staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant