Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2i2c/ucmerced have 34 user pods running for 6-7 days #3130

Closed
consideRatio opened this issue Sep 13, 2023 · 26 comments
Closed

2i2c/ucmerced have 34 user pods running for 6-7 days #3130

consideRatio opened this issue Sep 13, 2023 · 26 comments
Assignees

Comments

@consideRatio
Copy link
Member

consideRatio commented Sep 13, 2023

I figure the jupyterhub-idle-culling has failed for some reason, perhaps because they generate network activity regularly still, or because their kernels are considered active still?

I'm not looking into this right now, already working the "the cluster is slow" incident in #2947

Related to #3042, where culling is configured to max 7 days as a precaution.


Here is one pod, with the latest log entry from several days back.

[I 2023-09-07 03:53:53.735 SingleUserLabApp kernelmanager:321] Starting buffering for 9ae455a9-c907-46a7-b914-f72475f8c58d:aabf4f6a4aa249ea88f0d5ff2b0ed2de
[W 2023-09-07 04:41:24.819 SingleUserLabApp kernelmanager:637] Culling 'idle' kernel 'python3' (7537ac89-0bcd-480b-9c3a-456ebc7b8c04) with 0 connections due to 3791 seconds of inactivity.
[I 2023-09-07 04:41:24.823 SingleUserLabApp multikernelmanager:499] Kernel shutdown: 7537ac89-0bcd-480b-9c3a-456ebc7b8c04
[W 2023-09-07 04:51:24.820 SingleUserLabApp kernelmanager:637] Culling 'idle' kernel 'python3' (9ae455a9-c907-46a7-b914-f72475f8c58d) with 0 connections due to 3731 seconds of inactivity.
[I 2023-09-07 04:51:24.827 SingleUserLabApp multikernelmanager:499] Kernel shutdown: 9ae455a9-c907-46a7-b914-f72475f8c58d

Here are some hub logs:

[W 2023-09-13 15:10:14.237 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3
[W 2023-09-13 15:10:14.238 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617812678 ([email protected]@10.128.0.98) 10.03ms
[I 2023-09-13 15:10:18.847 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617818742 -> /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617818742 (@10.128.0.116) 0.78ms
[W 2023-09-13 15:10:18.912 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55
[W 2023-09-13 15:10:18.913 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617818742 ([email protected]@10.128.0.116) 3.54ms
[I 2023-09-13 15:10:21.802 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.42.10) 39.79ms
[I 2023-09-13 15:10:27.170 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617825675 -> /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617825675 (@10.128.0.17) 0.90ms
[W 2023-09-13 15:10:27.229 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3
[W 2023-09-13 15:10:27.230 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617825675 ([email protected]@10.128.0.17) 4.47ms
[I 2023-09-13 15:10:39.183 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617837675 -> /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617837675 (@10.128.0.36) 0.86ms
[W 2023-09-13 15:10:39.243 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3
[W 2023-09-13 15:10:39.243 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617837675 ([email protected]@10.128.0.36) 3.03ms
[I 2023-09-13 15:10:39.476 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.115) 74.21ms
[I 2023-09-13 15:10:41.076 JupyterHub log:191] 200 GET /hub/metrics (@10.0.7.73) 7.21ms
[I 2023-09-13 15:10:43.190 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.137) 11.98ms
[I 2023-09-13 15:10:45.849 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617845742 -> /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617845742 (@10.128.0.116) 0.70ms
[W 2023-09-13 15:10:45.925 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55
[W 2023-09-13 15:10:45.925 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617845742 ([email protected]@10.128.0.116) 15.18ms
[I 2023-09-13 15:10:45.936 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.152) 9.90ms
[I 2023-09-13 15:10:59.366 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.196) 17.76ms
[I 2023-09-13 15:11:00.075 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.74) 12.08ms
[I 2023-09-13 15:11:01.199 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617859675 -> /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617859675 (@10.128.0.116) 0.72ms
[W 2023-09-13 15:11:01.513 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3
[W 2023-09-13 15:11:01.513 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/87460ade-a29e-4419-8517-5ef1146adec3?1694617859675 ([email protected]@10.128.0.30) 15.73ms
[I 2023-09-13 15:11:02.846 JupyterHub log:191] 302 GET /user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617862740 -> /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617862740 (@10.128.0.116) 0.78ms
[W 2023-09-13 15:11:02.912 JupyterHub base:1436] Failing suspected API request to not-running server: /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55
[W 2023-09-13 15:11:02.913 JupyterHub log:191] 424 GET /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617862740 ([email protected]@10.128.0.116) 3.24ms
[I 2023-09-13 15:11:04.870 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.150) 20.00ms
[I 2023-09-13 15:11:07.231 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.169) 14.69ms
[I 2023-09-13 15:11:09.675 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.53) 11.83ms
[I 2023-09-13 15:11:12.278 JupyterHub log:191] 200 POST /hub/api/users/[email protected]/activity ([email protected]@10.0.22.102) 11.16ms
@sgibson91
Copy link
Member

So I searched this error from the above log snippets: Failing suspected API request to not-running server

The JupyterHub documentation on this seems to suggest that there's nothing we need to do about this? https://jupyterhub.readthedocs.io/en/stable/howto/log-messages.html#failing-suspected-api-request-to-not-running-server

@sgibson91
Copy link
Member

I googled the 424 http error for 424 GET /hub/user/[email protected]/api/kernels/a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55?1694617862740 ([email protected]@10.128.0.116) 3.24ms

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/424 So seems that another process failed, but which?

@sgibson91
Copy link
Member

I also see this

[W 2023-09-11 08:56:45.942 SingleUserLabApp zmqhandlers:227] WebSocket ping timeout after 119937 ms.
[I 2023-09-11 08:56:50.945 SingleUserLabApp kernelmanager:321] Starting buffering for a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55:37f58212-34a3-4875-9f05-0c359e1ac5cb
[E 2023-09-11 09:41:45.908 SingleUserLabApp mixins:573] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 571, in notify
        await client.fetch(req)
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[W 2023-09-11 09:51:56.643 SingleUserLabApp kernelmanager:637] Culling 'idle' kernel 'ir' (a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55) with 0 connections due to 3610 seconds of inactivity.
[I 2023-09-11 09:51:56.644 SingleUserLabApp multikernelmanager:499] Kernel shutdown: a90d4fab-b4c4-42a5-b8dc-9c5a8f9dfc55

@consideRatio
Copy link
Member Author

consideRatio commented Sep 14, 2023

I know sometimes there can be intermittent errors of reporting activity because the hub is being upgraded etc and temporarily down, so it could be a red herring to go after the failure to report activity.

It would be good to know what activity is reported though - is the user considered active by the server and that is why jupyterhub isn't culling the server, or is it because jupyterhub considers the user network activity - and that is why the user isn't culled?

( I don't know how to figure out what jupyterhub-idle-culler thinks, even logging from it is hard to get )

@sgibson91
Copy link
Member

There's a No Hub user identified for request message as well

[I 2023-09-11 06:21:56.640 SingleUserLabApp kernelmanager:572] Culling kernels with idle durations > 3600 seconds at 300 second intervals ...
[I 2023-09-11 06:21:56.641 SingleUserLabApp kernelmanager:577] Culling kernels even with connected clients
[I 2023-09-11 06:21:56.643 SingleUserLabApp log:186] 201 POST /user/[email protected]/api/sessions?1694413316500 ([email protected]@10.128.0.17) 69.93ms
[I 2023-09-11 06:21:56.714 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/sessions?1694413316638 ([email protected]@10.128.0.17) 3.20ms
[I 2023-09-11 06:21:56.792 SingleUserLabApp log:186] 200 PATCH /user/[email protected]/api/sessions/2489012b-9074-4204-9fd7-fd47c98166ed?1694413316716 ([email protected]@10.128.0.17) 2.57ms
[I 2023-09-11 06:21:56.799 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/kernels?1694413316721 ([email protected]@10.128.0.17) 3.96ms
[I 2023-09-11 06:21:56.872 SingleUserLabApp log:186] 204 PUT /user/[email protected]/lab/api/workspaces/auto-J?1694413316787 ([email protected]@10.128.0.17) 11.43ms
[I 2023-09-11 06:21:56.877 SingleUserLabApp log:186] 200 PATCH /user/[email protected]/api/sessions/2489012b-9074-4204-9fd7-fd47c98166ed?1694413316790 ([email protected]@10.128.0.17) 3.19ms
[I 2023-09-11 06:21:56.878 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/sessions?1694413316795 ([email protected]@10.128.0.17) 3.13ms
[I 2023-09-11 06:21:56.954 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/kernelspecs?1694413316874 ([email protected]@10.128.0.17) 4.67ms
[I 2023-09-11 06:21:57.399 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/terminals?1694413317313 ([email protected]@10.128.0.17) 2.75ms
[I 2023-09-11 06:21:57.504 SingleUserLabApp log:186] 204 PUT /user/[email protected]/lab/api/workspaces/auto-J?1694413317419 ([email protected]@10.128.0.17) 10.42ms
[I 2023-09-11 06:21:57.641 SingleUserLabApp log:186] 101 GET /user/[email protected]/api/kernels/21ca051b-75db-4380-88d2-da3a6180cf5b/channels?session_id=978bc379-a34a-43b0-b791-842f2e3eeb3f ([email protected]@10.128.0.35) 732.33ms
[I 2023-09-11 06:21:57.829 SingleUserLabApp kernelmanager:321] Starting buffering for 21ca051b-75db-4380-88d2-da3a6180cf5b:978bc379-a34a-43b0-b791-842f2e3eeb3f
[I 2023-09-11 06:21:57.850 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/sessions?1694413317772 ([email protected]@10.128.0.17) 1.96ms
[I 2023-09-11 06:21:57.852 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/kernels?1694413317777 ([email protected]@10.128.0.17) 1.60ms
[I 2023-09-11 06:21:57.855 SingleUserLabApp log:186] 101 GET /user/[email protected]/api/kernels/21ca051b-75db-4380-88d2-da3a6180cf5b/channels?session_id=196aba3d-7c1d-463b-99df-efcf0c14401d ([email protected]@10.128.0.116) 2.45ms
[I 2023-09-11 06:21:58.059 SingleUserLabApp log:186] 101 GET /user/[email protected]/api/kernels/21ca051b-75db-4380-88d2-da3a6180cf5b/channels?session_id=d62f6dd4-1d4b-4890-afff-fff24d8f6879 ([email protected]@10.128.0.57) 2.67ms
[I 2023-09-11 06:21:59.147 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/metrics/v1?1694413319072 ([email protected]@10.128.0.17) 2.37ms
[I 2023-09-11 06:22:04.227 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/metrics/v1?1694413324149 ([email protected]@10.128.0.17) 3.04ms
[I 2023-09-11 06:22:06.390 SingleUserLabApp log:186] 200 GET /user/[email protected]/api/contents?content=1&1694413326309 ([email protected]@10.128.0.17) 7.91ms
[W 2023-09-11 06:22:10.965 SingleUserLabApp auth:449] No Hub user identified for request

@consideRatio
Copy link
Member Author

Hmmmm... I think that could be whats shown if a request to do something arrives to the jupyter server without a cookie or query parameter token to associate with the request. Not sure if it could be relevant or not. It showed up quite soon after other activity

@yuvipanda
Copy link
Member

424 means that a user has a browser tab lost somewhere in the ether that's still making requests, even though the server has gone away. This used to be 503 earlier, but that was messing with error reporting metrics so was changed to 424 in jupyterhub/jupyterhub#3636.

jupyterhub/jupyterhub#4491 could be related and the fix?

@consideRatio
Copy link
Member Author

Related on the jupyter forum: https://discourse.jupyter.org/t/idle-culler-not-culling/21301

@sgibson91
Copy link
Member

Proposed test

I follow the instructions here https://infrastructure.2i2c.org/howto/custom-jupyterhub-image/ to create a new experimental hub image, which pins kubespawner to the main recent commit on main as suggested by Simon in https://discourse.jupyter.org/t/idle-culler-not-culling/21301/4 We deploy this experimental image to ucmerced and see if the long-running pods are cleaned up?

@consideRatio
Copy link
Member Author

Sounds great to me @sgibson91!!

@sgibson91
Copy link
Member

I opened #3149 that demonstrates my steps for the above steps. I have deployed this image, and hopefully the long-lived pods will be culled.

@sgibson91
Copy link
Member

I updated #3149 to change the correct image and redeployed, checked I could get a server which I could. Now just gonna watch the pods and see if they get cleaned up.

@sgibson91
Copy link
Member

So far, the new image doesn't seem to have changed anything

@consideRatio
Copy link
Member Author

consideRatio commented Sep 18, 2023

Looking at the ucmerced hub, it doesn't stand out much except for the use of the image I think.

I'd go ahead and use staging.2i2c.cloud to try start two different user servers, one using the ucmerced image, and one using a modern image - will one but not the other be culled?

I've went ahead and done this now, where [email protected] use jupter/base-notebook:latest, and [email protected] uses the old image used in ucmerced

image

I won't access these servers at all, or have any web browser window open either now. I'll check in at https://staging.2i2c.cloud/hub/admin though to see if they are culled later today.

@consideRatio
Copy link
Member Author

Oh... JupyterHub doesn't consider the servers currently running to be relevant to cull though. It sees them as not running at all.

image

@sgibson91
Copy link
Member

uses the old image used in ucmerced

I updated the image this morning though? #3149

What is considered "old" and "new" here?

@consideRatio
Copy link
Member Author

uses the old image used in ucmerced

I updated the image this morning though? #3149

What is considered "old" and "new" here?

Oh I meant the user image, not the hub image. I've not touched that or anything in ucmerced, but I've started an experiment of the user image specifically in staging. With "old" i meant that the ucmerced's user-server image is using outdated software in the image, because the image itself has pinned a lot of old stuff - so even a new build of the image causes us to get old software in the image.

@sgibson91
Copy link
Member

Too many images!

@consideRatio
Copy link
Member Author

I think what has gone wrong could be:

  1. Servers were running in ucmerced, everything was fine
  2. The hub restarted with the bug in return poll status after first load finish jupyterhub/kubespawner#742, and that led to jupyterhub loosing track of started servers.
  3. Use of a modern jupyterhub at a later time didn't change anything, because the servers running isn't going to be detected as servers that needs to be shut down or similar. jupyterhub-idle-culler only considers shutting down of started servers, but jupyterhub doesn't see them as running!

If this is the case, I think the fix could be to use a modern enough version of kubespawner going onwards, and then clearing up the servers that are started in k8s but not known to be started by jupyterhub. We can't know for sure if we've resolved the issue like this, we can only hope that it doesn't occurr again, and that the fix was to use a modern kubespawner version.

@consideRatio
Copy link
Member Author

I've concluded this is an issue not only for ucmerced, but also for utoronto's prod hub.

The utoronto hub also has a lot of user pods that isn't known to be running by JupyterHub, and as part of this, they can't be accessed via the proxy pod either - it only has routes configured to the user servers for servers known to be running. I grow more confident that this then is caused by jupyterhub/kubespawner#742 and the resolution is to cleanup all stranded k8s pods and update to use a new kubespawner version.

@consideRatio
Copy link
Member Author

I've started writing a script to cleanup our hubs, thinking that such script is a relevant resource for anyone else reading about this bug in a z2jh or kubespawner changelog.

@sgibson91
Copy link
Member

sgibson91 commented Sep 19, 2023

So to summarise next action points

@damianavila damianavila moved this from Todo 👍 to Waiting 🕛 in Sprint Board Sep 25, 2023
@sgibson91
Copy link
Member

I think this may have been closed by #3168 Please reopen if needed

@github-project-automation github-project-automation bot moved this from Waiting 🕛 to Done 🎉 in Sprint Board Sep 27, 2023
@damianavila
Copy link
Contributor

@consideRatio, can you confirm if the UC Merced ones were cleaned up? Thanks!

@damianavila
Copy link
Contributor

damianavila commented Oct 3, 2023

Well, this comment from @consideRatio seems to indicate UC Merced was done as well:

I've tested this cleanup logic across all hubs in the 2i2c and utoronto cluster, and it seems to work out fine. It is meant to be applied, and then removed once its run once.

@consideRatio
Copy link
Member Author

Yepp all hubs are cleaned up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants