Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Datalab won't connect to VM instance after long time waiting for it to be reachable at port 8081 #2124

Open
miguel2488 opened this issue Mar 15, 2019 · 9 comments

Comments

@miguel2488
Copy link

Hi,

i've been working on this for days, and have read a lot in google about this issue. Although i couldn't find anything to help me solving it.

The case is that i have created a datalab instance via the gcloud shell like this:

datalab create --image-name c2-deeplearning-tf-1-13-cu100-20190227 --disk-size-gb 100 --machine-type n1-standard-8 my-instance --network-name my-net-01 --zone europe-west1-b

it all works fine, i'm asked to create a passphrase, rsa keys are propagated and then, i got this message of death:

Waiting for Datalab to be reachable at http://localhost:8081/

I can SSH to the vm instance using the button to the right, or using gcloud compute ssh instance. No problems with that.

Running the datalab connect command passing --ssh-log-level=debug i got thousands of messages like this one:

Captura

It walks through all the ports trying to connect to the 8081 port but it never succeeds, so finally after a long waiting, i get this message:

connection closed
attempting to reconnect

and the whole process starts again from the beginning.

This is a screenshot of my firewall rules:

Captura

i think everything is ok here. What am i missing?? Where's the problem?? Can someone help please? i've been stuck here for over a week now, any help will be much appreciated.

Thank you very much in advance.

@antellgc
Copy link

Having the same problems here. @miguel2488 have you had any luck with a fix?

@miguel2488
Copy link
Author

Nope, nothing new here, i wasn't able to fix it since i don't have a clue about where the problem is coming. Instead of using datablab, i resigned myself to run jupyter notebooks on the machine, i'm totally blind with this and for what i've seen so far, no one seems to care about this thread. I wish you a better luck.

@hacktuarial
Copy link

I had the same problem, and observed that the container running jupyter on the VM took ~5 minutes to start up. My workaround was:

  • datalab create ... --ssh-log-level=debug
  • wait for the Connection refused messages to begin
  • CTRL+c to kill it
  • gcloud compute ssh ..., then run docker ps every 1-2 minutes until the logger and datalab containers appear
  • datalab connect ...
    Then I was able to use datalab in the normal way.

@MchlUh
Copy link

MchlUh commented Feb 5, 2020

Hello hacktuarial,
I have the same issue, and tried your solution.
the datalab container never appears for me.
Did you simply run cloud compute ssh ...(name of instance) ?
Thanks for your help !

@hacktuarial
Copy link

Yes, that's what I ran. Can you post a sample of your ssh logs? It sounds like the problem may be with the datalab create command.

@MchlUh
Copy link

MchlUh commented Feb 5, 2020

I was using a datalab connect ... command until now, and tried really with datalab create ....
It actually works exactly as you said, the loggers and datalab containers appeared !

It has maybe something to do with the way I created my instance at the beginning, I used:
datalab beta create-gpu datalab-instance-name at the time.

Anyway, I am now able to use Datalab !
Thanks :)

@MchlUh
Copy link

MchlUh commented Feb 5, 2020

It seems that when creating an instance with a GPU, the same problem appears but this solution does not apply.
I have now created it for an hour, and docker ps only shows the logger container but no datalab container.

@chanyou0311
Copy link

I have a similar problem with @MichaelTheBrute.

I tried to launch an instance of Datalab with the command below.

$ datalab beta create-gpu --machine-type n1-standard-4 --zone us-west1-b --accelerator-type nvidia-tesla-k80 --accelerator-count 1 datalab-instance
By accepting below, you will download and install the
following third-party software onto your managed GCE instances:
    NVidia GPU Driver: NVIDIA-Linux-x86_64-390.46
Do you accept (y/N)?: y
Creating the disk datalab-instance-pd
Creating the instance datalab-instance

Due to GPU Driver installation, please note that Datalab GPU instances take significantly longer to startup compared to non-GPU instances.
Created [https://www.googleapis.com/compute/beta/projects/xxxxxxxx/zones/us-west1-b/instances/datalab-instance].
Connecting to datalab-instance.
This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
Waiting for Datalab to be reachable at http://localhost:8081/

However, there is no response after more than 30 minutes.
I saw information that it took about 15 minutes, but I thought it was still too long.

I made an ssh connection to the instance and started investigating.
As discussed before, I also ran the docker ps command.

$ datalab@datalab-instance ~ $ sudo docker ps -a
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS              PORTS               NAMES
4994361cf048        gcr.io/google-containers/fluentd-gcp:2.0.17   "/bin/sh -c '/run.sh…"   19 minutes ago      Up 19 minutes       80/tcp              logger

The datalab container was not running.
However, when I ran the same command a few minutes later, I saw gcr.io/cloud-datalab/datalab-gpu:latest image only once.
(I forgot to take notes.)
Since then, we have never been able to see the container.

When the CPU worked correctly, I thought that it might be because the GPU was not set up correctly.
The GPU setup seems to be done in the startup script, so I checked that the script finished successfully.

datalab@datalab-instance ~ $ systemctl status google-startup-scripts.service
● google-startup-scripts.service - Google Compute Engine Startup Scripts
   Loaded: loaded (/usr/lib/systemd/system/google-startup-scripts.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2020-02-11 07:27:30 UTC; 34min ago
 Main PID: 421 (code=exited, status=0/SUCCESS)
      CPU: 881ms

I checked the log with the journalctl command, but it seemed to have finished successfully.

In the process, I noticed that wait-for-startup-script.service did not finish properly.

datalab@datalab-instance ~ $ systemctl --failed
  UNIT                            LOAD   ACTIVE SUB    DESCRIPTION
● wait-for-startup-script.service loaded failed failed Wait for the startup script to setup required directories
datalab@datalab-instance ~ $ sudo journalctl -u wait-for-startup-script.service
-- Logs begin at Tue 2020-02-11 06:59:19 UTC, end at Tue 2020-02-11 08:05:27 UTC. --
Feb 11 06:59:34 datalab-instance systemd[1]: Starting Wait for the startup script to setup required directories...
Feb 11 06:59:34 datalab-instance docker-credential-gcr[768]: ERROR: Unable to save docker config: mkdir /root/.docker: read-only file system
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Control process exited, code=exited status=1
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Failed with result 'exit-code'.
Feb 11 06:59:34 datalab-instance systemd[1]: Failed to start Wait for the startup script to setup required directories.
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Consumed 82ms CPU time
Feb 11 06:59:34 datalab-instance systemd[1]: Starting Wait for the startup script to setup required directories...
Feb 11 06:59:34 datalab-instance docker-credential-gcr[792]: ERROR: Unable to save docker config: mkdir /root/.docker: read-only file system
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Control process exited, code=exited status=1
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Failed with result 'exit-code'.
Feb 11 06:59:34 datalab-instance systemd[1]: Failed to start Wait for the startup script to setup required directories.
Feb 11 06:59:34 datalab-instance systemd[1]: wait-for-startup-script.service: Consumed 94ms CPU time

You can confirm that an error has occurred in docker-credential-gcr.
I don't understand what this means in the startup-script, but I hope it helps.

I will continue to investigate.

@chanyou0311
Copy link

May be related to this Pull Request.
#2147

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants