GESIS BinderHub server was accumulating Running pods that were more than 1 day old #2686

rgaiacs · 2023-06-21T15:24:59Z

Around 2023-06-21 17:15 CEST, we launch a cron job to avoid this problem.

Need further investigation to discover the source of the problem.

minrk · 2023-06-23T07:34:05Z

OVH is seeing this, too. I suspect it's a recent update to jupyterhub/zero-to-jupyterhub that's causing something to get missed.

Two categories of problem to track down:

JupyterHub/KubeSpawner is leaving orphan pods (i.e. the pod is running but jupyterhub doesn't have a record of it). symptom: the pods older than 6 hours do not have an associated user.
max-age culling is not working properly. This could be because of a bug in start-time reporting from jupyterhub, or a bug in jupyterhub-idle-culler not actually performing the max age culling for some reason

rgaiacs · 2023-06-23T07:55:53Z

Thanks for the information covering OVH.

minrk · 2023-07-03T09:01:09Z

I've looked through some logs, and OVH definitely has quite a few orphan pods. So I think this is a change in kubespawner that's making it possible to leave orphaned pods, likely failing to clean up after a failed start (hard to say precisely, because OVH has no log retention, so we can only look back into the very recent past). OVH is showing occasional reflector failure events, which may well be related, because deleting a pod not in the reflector will skip the deletion.

Unfortunately, JupyterHub doesn't give Spawners a hook to look for orphaned resources.

Here's a notebook to collect and view (and clean up, if you want) orphaned pods on a cluster.

manics mentioned this issue Jun 30, 2023

EKS in AWS Curvenote account #2652

Merged

5 tasks

minrk mentioned this issue Aug 14, 2023

return poll status after first load finish jupyterhub/kubespawner#742

Merged

minrk mentioned this issue Aug 28, 2023

sweep outdated pods in minesweeper #2730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GESIS BinderHub server was accumulating Running pods that were more than 1 day old #2686

GESIS BinderHub server was accumulating Running pods that were more than 1 day old #2686

rgaiacs commented Jun 21, 2023 •

edited

Loading

minrk commented Jun 23, 2023

rgaiacs commented Jun 23, 2023

minrk commented Jul 3, 2023

GESIS BinderHub server was accumulating Running pods that were more than 1 day old #2686

GESIS BinderHub server was accumulating Running pods that were more than 1 day old #2686

Comments

rgaiacs commented Jun 21, 2023 • edited Loading

minrk commented Jun 23, 2023

rgaiacs commented Jun 23, 2023

minrk commented Jul 3, 2023

rgaiacs commented Jun 21, 2023 •

edited

Loading