Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent/runner: Fix vm-monitor reconnect busyloop #552

Merged

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Oct 9, 2023

Introduced by #506, noticed after deploying v0.18.0 on staging.

Basically, we just never set the start time, so "time since last start" was always equal to "time since unix epoch", which means we'd always think that the connection was running long enough to immediately retry.


Some logs demonstrating this:

2023-10-09 02:26:35.942	{"level":"error","ts":1696818395.9423525,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:518","msg":"Failed to connect to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","error":"error establishing websocket connection to ws://10.6.28.2:10301/monitor: failed to WebSocket dial: failed to send handshake request: Get \"http://10.6.28.2:10301/monitor\": dial tcp 10.6.28.2:10301: connect: connection refused","stacktrace":"github.com/neondatabase/autoscaling/pkg/agent.(*Runner).connectToMonitorLoop\n\t/workspace/pkg/agent/runner.go:518\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).Run.func6\n\t/workspace/pkg/agent/runner.go:305\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:388"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9423616,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:457","msg":"Reset previous vm-monitor connection","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9423647,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:481","msg":"Immediately retrying connection to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","totalRuntime":9223372036.854776}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9423707,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/dispatcher.go:207","msg":"Connecting to vm-monitor via websocket","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor"}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9424143,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:106","msg":"Recalculating ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.9424112,"state":{"config":{"DefaultScalingConfig":{"loadAverageFractionTarget":0.9,"memoryUsageFractionTarget":0.75},"PluginRequestTick":5000000000,"PluginDeniedRetryWait":2000000000,"MonitorDeniedDownscaleCooldown":5000000000,"MonitorRequestedUpscaleValidPeriod":10000000000,"MonitorRetryWait":3000000000},"vm":{"name":"compute-white-salad-01036121","namespace":"default","cpu":{"min":"250m","max":"250m","use":"250m"},"mem":{"min":1,"max":1,"use":1,"slotSize":"1Gi"},"alwaysMigrate":false,"scalingEnabled":true},"plugin":{"alive":true,"ongoingRequest":false,"computeUnit":{"vCPUs":"250m","mem":1},"lastRequest":{"time":"2023-10-09T02:26:35.736571142Z","resources":{"vCPUs":"250m","mem":1}},"permit":{"vCPUs":"250m","mem":1}},"monitor":{"ongoingRequest":null,"requestedUpscale":null,"deniedDownscale":null,"approved":null,"downscaleFailureAt":null,"upscaleFailureAt":null},"neonvm":{"lastSuccess":null,"ongoingRequested":null,"requestFailedAt":null},"metrics":{"loadAvg1M":0.046875,"loadAvg5M":0.024414062,"memoryUsageBytes":213125570}}}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9424286,"logger":"autoscaler-agent.runner.exec.core","caller":"core/state.go:221","msg":"Calculated desired resources","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"current":{"vCPU":"0.25","mem":1},"target":{"vCPU":"0.25","mem":1}}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9424388,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:109","msg":"New ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.9424112,"actions":{"wait":{"duration":4794159798}}}
2023-10-09 02:26:35.942	{"level":"error","ts":1696818395.9424558,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:518","msg":"Failed to connect to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","error":"error establishing websocket connection to ws://10.6.28.2:10301/monitor: failed to WebSocket dial: failed to send handshake request: Get \"http://10.6.28.2:10301/monitor\": dial tcp 10.6.28.2:10301: connect: connection refused","stacktrace":"github.com/neondatabase/autoscaling/pkg/agent.(*Runner).connectToMonitorLoop\n\t/workspace/pkg/agent/runner.go:518\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).Run.func6\n\t/workspace/pkg/agent/runner.go:305\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:388"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9424653,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:457","msg":"Reset previous vm-monitor connection","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9424684,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:481","msg":"Immediately retrying connection to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","totalRuntime":9223372036.854776}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9424741,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/dispatcher.go:207","msg":"Connecting to vm-monitor via websocket","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor"}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9425175,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:106","msg":"Recalculating ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.942517,"state":{"config":{"DefaultScalingConfig":{"loadAverageFractionTarget":0.9,"memoryUsageFractionTarget":0.75},"PluginRequestTick":5000000000,"PluginDeniedRetryWait":2000000000,"MonitorDeniedDownscaleCooldown":5000000000,"MonitorRequestedUpscaleValidPeriod":10000000000,"MonitorRetryWait":3000000000},"vm":{"name":"compute-white-salad-01036121","namespace":"default","cpu":{"min":"250m","max":"250m","use":"250m"},"mem":{"min":1,"max":1,"use":1,"slotSize":"1Gi"},"alwaysMigrate":false,"scalingEnabled":true},"plugin":{"alive":true,"ongoingRequest":false,"computeUnit":{"vCPUs":"250m","mem":1},"lastRequest":{"time":"2023-10-09T02:26:35.736571142Z","resources":{"vCPUs":"250m","mem":1}},"permit":{"vCPUs":"250m","mem":1}},"monitor":{"ongoingRequest":null,"requestedUpscale":null,"deniedDownscale":null,"approved":null,"downscaleFailureAt":null,"upscaleFailureAt":null},"neonvm":{"lastSuccess":null,"ongoingRequested":null,"requestFailedAt":null},"metrics":{"loadAvg1M":0.046875,"loadAvg5M":0.024414062,"memoryUsageBytes":213125570}}}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9425313,"logger":"autoscaler-agent.runner.exec.core","caller":"core/state.go:221","msg":"Calculated desired resources","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"current":{"vCPU":"0.25","mem":1},"target":{"vCPU":"0.25","mem":1}}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9425418,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:109","msg":"New ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.942517,"actions":{"wait":{"duration":4794054255}}}
2023-10-09 02:26:35.942	{"level":"error","ts":1696818395.94256,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:518","msg":"Failed to connect to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","error":"error establishing websocket connection to ws://10.6.28.2:10301/monitor: failed to WebSocket dial: failed to send handshake request: Get \"http://10.6.28.2:10301/monitor\": dial tcp 10.6.28.2:10301: connect: connection refused","stacktrace":"github.com/neondatabase/autoscaling/pkg/agent.(*Runner).connectToMonitorLoop\n\t/workspace/pkg/agent/runner.go:518\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).Run.func6\n\t/workspace/pkg/agent/runner.go:305\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:388"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9425693,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:457","msg":"Reset previous vm-monitor connection","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop"}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9425724,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:481","msg":"Immediately retrying connection to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","totalRuntime":9223372036.854776}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9425783,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/dispatcher.go:207","msg":"Connecting to vm-monitor via websocket","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor"}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9426212,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:106","msg":"Recalculating ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.9426205,"state":{"config":{"DefaultScalingConfig":{"loadAverageFractionTarget":0.9,"memoryUsageFractionTarget":0.75},"PluginRequestTick":5000000000,"PluginDeniedRetryWait":2000000000,"MonitorDeniedDownscaleCooldown":5000000000,"MonitorRequestedUpscaleValidPeriod":10000000000,"MonitorRetryWait":3000000000},"vm":{"name":"compute-white-salad-01036121","namespace":"default","cpu":{"min":"250m","max":"250m","use":"250m"},"mem":{"min":1,"max":1,"use":1,"slotSize":"1Gi"},"alwaysMigrate":false,"scalingEnabled":true},"plugin":{"alive":true,"ongoingRequest":false,"computeUnit":{"vCPUs":"250m","mem":1},"lastRequest":{"time":"2023-10-09T02:26:35.736571142Z","resources":{"vCPUs":"250m","mem":1}},"permit":{"vCPUs":"250m","mem":1}},"monitor":{"ongoingRequest":null,"requestedUpscale":null,"deniedDownscale":null,"approved":null,"downscaleFailureAt":null,"upscaleFailureAt":null},"neonvm":{"lastSuccess":null,"ongoingRequested":null,"requestFailedAt":null},"metrics":{"loadAvg1M":0.046875,"loadAvg5M":0.024414062,"memoryUsageBytes":213125570}}}
2023-10-09 02:26:35.942	{"level":"info","ts":1696818395.9426353,"logger":"autoscaler-agent.runner.exec.core","caller":"core/state.go:221","msg":"Calculated desired resources","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"current":{"vCPU":"0.25","mem":1},"target":{"vCPU":"0.25","mem":1}}
2023-10-09 02:26:35.942	{"level":"debug","ts":1696818395.9426455,"logger":"autoscaler-agent.runner.exec.core","caller":"executor/core.go:109","msg":"New ActionSet","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"now":1696818395.9426205,"actions":{"wait":{"duration":4793950764}}}
2023-10-09 02:26:35.942	{"level":"error","ts":1696818395.9426622,"logger":"autoscaler-agent.runner.vm-monitor","caller":"agent/runner.go:518","msg":"Failed to connect to vm-monitor","virtualmachine":{"namespace":"default","name":"compute-white-salad-01036121"},"pod":{"namespace":"default","name":"compute-white-salad-01036121-tfdnr"},"taskName":"vm-monitor reconnection loop","addr":"ws://10.6.28.2:10301/monitor","error":"error establishing websocket connection to ws://10.6.28.2:10301/monitor: failed to WebSocket dial: failed to send handshake request: Get \"http://10.6.28.2:10301/monitor\": dial tcp 10.6.28.2:10301: connect: connection refused","stacktrace":"github.com/neondatabase/autoscaling/pkg/agent.(*Runner).connectToMonitorLoop\n\t/workspace/pkg/agent/runner.go:518\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).Run.func6\n\t/workspace/pkg/agent/runner.go:305\ngithub.com/neondatabase/autoscaling/pkg/agent.(*Runner).spawnBackgroundWorker.func1\n\t/workspace/pkg/agent/runner.go:388"}

Introduced by #506, noticed after deploying v0.18.0 on staging.

Basically, we just never set the start time, so "time since last start"
was always equal to "time since unix epoch", which means we'd always
think that the connection was running long enough to immediately retry.
@sharnoff sharnoff merged commit 37e983e into main Oct 9, 2023
7 checks passed
@sharnoff sharnoff deleted the sharnoff/agent-runner-fix-monitor-reconnect-busyloop branch October 9, 2023 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant