JupyterHub/KubeSpawner lost track of started user server, pod kept running #697

darabos · 2022-12-16T11:42:18Z

Bug description

Our Zero-to-JupyterHub 2.0.0 instance running on GKE (1.24.5-gke.600) has failed to start an instance with this error message:

Multi-Attach error for volume "pvc-27f9f2c0-f18d-449f-a9c3-b0d909bd0ea0"
Volume is already exclusively attached to one node and can't be attached to another

The JupyterHub admin interface does not show the instance as running. But looking in GKE the pod is actually running, days after the incident. So the pod was leaked. (We actually have 4 cases of this.)

In the JupyterHub logs I see the following for this user:

[I 2022-12-12 09:15:45.005 JupyterHub roles:238] Adding role user for User: [email protected]
[I 2022-12-12 09:15:45.021 JupyterHub base:810] User logged in: [email protected]
[I 2022-12-12 09:15:45.192 JupyterHub log:186] 302 GET /hub/ -> /hub/spawn ([email protected]@10.128.0.69) 13.12ms
[I 2022-12-12 09:15:45.367 JupyterHub provider:651] Creating oauth client jupyterhub-user-XXXXXX%40mail.tau.ac.il
[I 2022-12-12 09:15:45.387 JupyterHub spawner:2509] Attempting to create pvc claim-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 09:15:45.389 JupyterHub log:186] 302 GET /hub/spawn -> /hub/spawn-pending/[email protected] ([email protected]@10.128.0.69) 38.38ms
[I 2022-12-12 09:15:45.423 JupyterHub spawner:2469] Attempting to create pod jupyter-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 09:15:45.550 JupyterHub pages:394] [email protected] is pending spawn
[I 2022-12-12 09:15:45.552 JupyterHub log:186] 200 GET /hub/spawn-pending/[email protected] ([email protected]@10.128.0.69) 4.76ms
[I 2022-12-12 09:16:24.943 JupyterHub base:963] User [email protected] took 39.588 seconds to start
[I 2022-12-12 09:16:24.943 JupyterHub proxy:333] Adding user [email protected] to proxy /user/[email protected]/ => http://10.108.1.31:8888
[I 2022-12-12 09:16:24.947 JupyterHub users:749] Server [email protected] is ready
[I 2022-12-12 09:16:24.948 JupyterHub log:186] 200 GET /hub/api/users/[email protected]/server/progress ([email protected]@10.128.0.69) 38939.51ms
[I 2022-12-12 09:16:25.114 JupyterHub log:186] 302 GET /hub/spawn-pending/[email protected] -> /user/[email protected]/ ([email protected]@10.128.0.69) 4.88ms
[I 221212 10:16:40 __init__:236] Culling server [email protected] (inactive for 01:00:14)
[I 2022-12-12 10:16:40.036 JupyterHub proxy:359] Removing user [email protected] from proxy (/user/[email protected]/)
[I 2022-12-12 10:16:40.039 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 10:16:43.062 JupyterHub base:1170] User [email protected] server took 3.025 seconds to stop
[I 2022-12-12 10:16:43.063 JupyterHub log:186] 204 DELETE /hub/api/users/XXXXXX%40mail.tau.ac.il/server ([email protected]) 3029.02ms
[I 2022-12-12 15:28:20.375 JupyterHub log:186] 302 GET /user/[email protected]/lab -> /hub/user/[email protected]/lab (@10.128.0.69) 1.01ms
[W 2022-12-12 15:28:20.593 JupyterHub log:186] 424 GET /hub/user/[email protected]/lab ([email protected]@10.128.0.69) 14.60ms
[I 2022-12-12 15:28:23.322 JupyterHub provider:651] Creating oauth client jupyterhub-user-XXXXXX%40mail.tau.ac.il
[I 2022-12-12 15:28:23.346 JupyterHub spawner:2509] Attempting to create pvc claim-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 15:28:23.349 JupyterHub log:186] 302 GET /hub/spawn/[email protected]?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab -> /hub/spawn-pending/[email protected]?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab ([email protected]@10.128.0.69) 46.38ms
[I 2022-12-12 15:28:23.363 JupyterHub spawner:2525] PVC claim-XXXXXX-40mail-2etau-2eac-2eil already exists, so did not create new pvc.
[I 2022-12-12 15:28:23.370 JupyterHub spawner:2469] Attempting to create pod jupyter-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 15:28:23.553 JupyterHub pages:394] [email protected] is pending spawn
[I 2022-12-12 15:28:23.556 JupyterHub log:186] 200 GET /hub/spawn-pending/[email protected]?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab ([email protected]@10.128.0.69) 9.76ms
[I 2022-12-12 15:31:36.825 JupyterHub log:186] 302 GET /hub/ -> /user/[email protected]/ ([email protected]@10.128.0.69) 8.11ms
[I 2022-12-12 15:31:39.064 JupyterHub log:186] 302 GET /user/[email protected]/files/Go%20to%20LynxKite.html -> /hub/user/[email protected]/files/Go%20to%20LynxKite.html (@10.128.0.69) 1.08ms
[I 2022-12-12 15:31:39.279 JupyterHub log:186] 303 GET /hub/user/[email protected]/files/Go%20to%20LynxKite.html ([email protected]@10.128.0.69) 4.71ms
[I 2022-12-12 15:31:39.502 JupyterHub pages:394] [email protected] is pending spawn
[I 2022-12-12 15:31:39.505 JupyterHub log:186] 200 GET /hub/spawn-pending/[email protected]?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Ffiles%2FGo%2520to%2520LynxKite.html ([email protected]@10.128.0.69) 7.97ms
[W 2022-12-12 15:33:23.346 JupyterHub user:824] [email protected]'s server failed to start in 300 seconds, giving up.
[I 2022-12-12 15:33:23.347 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:26.518 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:29.865 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:33.638 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:37.538 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:41.187 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[E 2022-12-12 15:33:41.229 JupyterHub gen:630] Exception in Future <Task finished name='Task-6089322' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.9/site-packages/kubespawner/spawner.py:2621> exception=TimeoutError('pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil did not start in 300 seconds!')> after timeout
    asyncio.exceptions.TimeoutError: pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil did not start in 300 seconds!
[I 2022-12-12 15:33:44.551 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:44.568 JupyterHub pages:394] [email protected] is pending stop
[I 2022-12-12 15:33:44.570 JupyterHub log:186] 200 GET /hub/spawn-pending/[email protected]?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab ([email protected]@10.128.0.69) 11.14ms
[E 2022-12-12 15:34:50.986 JupyterHub app:2461] Failed to poll spawner for [email protected], assuming the spawner is not running.
[W 2022-12-12 15:34:50.987 JupyterHub app:2512] [email protected] appears to have stopped while the Hub was down
[W 2022-12-12 15:41:49.964 JupyterHub user:881] [email protected]'s server never showed up at http://hub-65fd484dc-cbx7j:60937/user/[email protected]/ after 300 seconds. Giving up.
[W 2022-12-12 15:41:49.965 JupyterHub spawner:2258] Pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil url changed! hub-65fd484dc-cbx7j:60937 -> 10.108.20.3:8888
[I 2022-12-12 15:41:49.974 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:41:53.017 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:41:56.367 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:42:00.034 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:42:03.956 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil

(I've replaced the user name with XXXXXX. Sorry for the long block.)

The user's instance first started normally at 9:15. It was shut down for inactivity at 10:16. At 15:28 the user came back and JupyterHub tried to start up the instance again. Startup failed and JupyterHub tried to delete the instance at 15:33. It keeps trying for 10 minutes, but halfway decides that it "appears to have stopped while the Hub was down".

The Hub was down? Yes, looks like it went down at 15:33:

[C 2022-12-12 15:33:43.937 JupyterHub base:1037] Aborting due to 5 consecutive spawn failures

My guess is that restart glitched the cleanup process and the instance was left running. The next time the user tried it they got the "Volume is already exclusively attached" because the volume was attached to the leaked instance.

Expected behaviour

The startup failure may be my fault or GKE had a slow day. But JupyterHub shouldn't leave pods running.

How to reproduce

Seems very difficult. I have saved the JupyterHub logs from the incident. It includes some personal email addresses but I can share it privately.

Your personal set up

It's a normal Zero-to-JupyterHub on GKE.

# config.yaml
proxy:
  secretToken: ...
  https:
    enabled: true
    hosts: ...
    letsencrypt: ...
hub:
  config:
    JupyterHub:
      authenticator_class: google
    GoogleOAuthenticator: ...
  extraConfig:
    extra: |
      c.KubeSpawner.http_timeout = 300
singleuser: (a custom image)

Thanks for Zero-to-JupyterHub! It's the best!

The text was updated successfully, but these errors were encountered:

welcome · 2022-12-16T11:42:20Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

jabbera · 2024-03-26T13:50:52Z

This was fixed here: #742

darabos added the bug label Dec 16, 2022

This comment was marked as duplicate.

Sign in to view

consideRatio transferred this issue from jupyterhub/zero-to-jupyterhub-k8s Feb 3, 2023

consideRatio changed the title ~~Leaked instance~~ JupyterHub/KubeSpawner lost track of started user server, pod kept running Feb 3, 2023

consideRatio closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JupyterHub/KubeSpawner lost track of started user server, pod kept running #697

JupyterHub/KubeSpawner lost track of started user server, pod kept running #697

darabos commented Dec 16, 2022

welcome bot commented Dec 16, 2022

This comment was marked as duplicate.

jabbera commented Mar 26, 2024

JupyterHub/KubeSpawner lost track of started user server, pod kept running #697

JupyterHub/KubeSpawner lost track of started user server, pod kept running #697

Comments

darabos commented Dec 16, 2022

Bug description

Expected behaviour

How to reproduce

Your personal set up

welcome bot commented Dec 16, 2022

This comment was marked as duplicate.

jabbera commented Mar 26, 2024