-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2i2c/ucmerced have 34 user pods running for 6-7 days #3130
Comments
So I searched this error from the above log snippets: The JupyterHub documentation on this seems to suggest that there's nothing we need to do about this? https://jupyterhub.readthedocs.io/en/stable/howto/log-messages.html#failing-suspected-api-request-to-not-running-server |
I googled the 424 http error for https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/424 So seems that another process failed, but which? |
I also see this
|
I know sometimes there can be intermittent errors of reporting activity because the hub is being upgraded etc and temporarily down, so it could be a red herring to go after the failure to report activity. It would be good to know what activity is reported though - is the user considered active by the server and that is why jupyterhub isn't culling the server, or is it because jupyterhub considers the user network activity - and that is why the user isn't culled? ( I don't know how to figure out what jupyterhub-idle-culler thinks, even logging from it is hard to get ) |
There's a
|
Hmmmm... I think that could be whats shown if a request to do something arrives to the jupyter server without a cookie or query parameter token to associate with the request. Not sure if it could be relevant or not. It showed up quite soon after other activity |
424 means that a user has a browser tab lost somewhere in the ether that's still making requests, even though the server has gone away. This used to be 503 earlier, but that was messing with error reporting metrics so was changed to 424 in jupyterhub/jupyterhub#3636. jupyterhub/jupyterhub#4491 could be related and the fix? |
Related on the jupyter forum: https://discourse.jupyter.org/t/idle-culler-not-culling/21301 |
Proposed testI follow the instructions here https://infrastructure.2i2c.org/howto/custom-jupyterhub-image/ to create a new experimental hub image, which pins kubespawner to the main recent commit on |
Sounds great to me @sgibson91!! |
I opened #3149 that demonstrates my steps for the above steps. I have deployed this image, and hopefully the long-lived pods will be culled. |
I updated #3149 to change the correct image and redeployed, checked I could get a server which I could. Now just gonna watch the pods and see if they get cleaned up. |
So far, the new image doesn't seem to have changed anything |
Looking at the ucmerced hub, it doesn't stand out much except for the use of the image I think. I'd go ahead and use staging.2i2c.cloud to try start two different user servers, one using the ucmerced image, and one using a modern image - will one but not the other be culled? I've went ahead and done this now, where [email protected] use jupter/base-notebook:latest, and [email protected] uses the old image used in ucmerced I won't access these servers at all, or have any web browser window open either now. I'll check in at https://staging.2i2c.cloud/hub/admin though to see if they are culled later today. |
I updated the image this morning though? #3149 What is considered "old" and "new" here? |
Oh I meant the user image, not the hub image. I've not touched that or anything in ucmerced, but I've started an experiment of the user image specifically in staging. With "old" i meant that the ucmerced's user-server image is using outdated software in the image, because the image itself has pinned a lot of old stuff - so even a new build of the image causes us to get old software in the image. |
I think what has gone wrong could be:
If this is the case, I think the fix could be to use a modern enough version of kubespawner going onwards, and then clearing up the servers that are started in k8s but not known to be started by jupyterhub. We can't know for sure if we've resolved the issue like this, we can only hope that it doesn't occurr again, and that the fix was to use a modern kubespawner version. |
I've concluded this is an issue not only for ucmerced, but also for utoronto's prod hub. The utoronto hub also has a lot of user pods that isn't known to be running by JupyterHub, and as part of this, they can't be accessed via the proxy pod either - it only has routes configured to the user servers for servers known to be running. I grow more confident that this then is caused by jupyterhub/kubespawner#742 and the resolution is to cleanup all stranded k8s pods and update to use a new kubespawner version. |
I've started writing a script to cleanup our hubs, thinking that such script is a relevant resource for anyone else reading about this bug in a z2jh or kubespawner changelog. |
So to summarise next action points
|
I think this may have been closed by #3168 Please reopen if needed |
@consideRatio, can you confirm if the UC Merced ones were cleaned up? Thanks! |
Well, this comment from @consideRatio seems to indicate UC Merced was done as well:
|
Yepp all hubs are cleaned up! |
I figure the jupyterhub-idle-culling has failed for some reason, perhaps because they generate network activity regularly still, or because their kernels are considered active still?
I'm not looking into this right now, already working the "the cluster is slow" incident in #2947
Related to #3042, where culling is configured to max 7 days as a precaution.
Here is one pod, with the latest log entry from several days back.
Here are some hub logs:
The text was updated successfully, but these errors were encountered: