Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent fails to install SSH server when running in venv/Conda #3

Open
norrishd opened this issue Jul 12, 2021 · 5 comments
Open

Agent fails to install SSH server when running in venv/Conda #3

norrishd opened this issue Jul 12, 2021 · 5 comments

Comments

@norrishd
Copy link

norrishd commented Jul 12, 2021

I've followed fairly straightforward steps to install a ClearML agent and connect to it using clearml-session, but get the following output:

Installing SSH Server on ip-172-31-4-42 [172.31.4.42]
Unable to load host key "/home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_rsa_key.pub": invalid format
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_rsa_key.pub
Unable to load host key "/home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ecdsa_key.pub": invalid format
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ecdsa_key.pub
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ed25519_key.pub
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring

On the client side I then get:

Password: Error: incorrect password
Please enter password manually:

Any suggestions? Would the recommendation to be to install/run the ClearML agent as root and/or using the system Python?

Steps to reproduce

On the agent:

# System: Ubuntu Focal 20.04, AMD64
# Install Miniconda, then
conda create -n clearml python=3.8
pip install clearml-agent
clearml-agent init
# Copy/paste credentials obtained from ClearML server
clearml-agent daemon --queue default --foreground

Then on the client:

clearml-session --public_ip true

# {
#     "base_task_id": null,
#    "git_credentials": false,
#    "jupyter_lab": true,
#    "password": "<long random-looking password>",
#    "public_ip": true,
#    "queue": "default",
#    "vscode_server": true
#}
@norrishd norrishd changed the title Agent fails to install SSH server when install in venv/Conda Agent fails to install SSH server when running in venv/Conda Jul 12, 2021
@jkhenning
Copy link
Member

Hi @norrishd ,

Thanks for the details - I'll try to reproduce and update as soon as possible!

@bmartinn
Copy link
Member

Hi @norrishd
The agent has no permissions to install the SSH server when running inside venv/conda.
I'm not sure how we can support it without having root access for it.
If an SSH daemon is already installed, it should be able to spin a second copy of it.
wdyt?

@norrishd
Copy link
Author

norrishd commented Jul 20, 2021

Thanks for the explanation @bmartinn! So do you mean that the current version is able to spin a second SSH daemon? (assuming there's an SSH daemon installed). If so that's very cool and would be fine (I must just be doing something wrong)

I tend to use venvs for everything just to avoid ever messing with the system Python. But I guess the use case for clearml-agent is that it's intended to run on servers (or in containers?) that are reserved for that purpose, so the recommendation is to use the system Python and install necessary packages there?

Would you also recommend running it as sudo, or does it not need that level of privileges?

@bmartinn
Copy link
Member

If so that's very cool and would be fine (I must just be doing something wrong)

Yes, at least in theory (if this doesn't work and /usr/sbin/sshd is still preinstalled, let me know what's the setup, it might be we are missing something)

... But I guess the use case for clearml-agent is that it's intended to run on servers

The agent itself can be installed on a venv (even though it might be easier to install system wide).
The issue is the process the agent spins, I.e. when the agent gets a Job (a Task) it can either, create a new temporary venv for the Task install everything the Task needs there, spin the process and leave. Or it can spin a container for the Task, then repeat the same process (venv creation) inside the container.
When the agent is used to spin the clearml-session usually the setup i s the agent is running in docker mode (i.e. with the flag --docker, then it spins all jobs inside a container, including the clearml-session's remote interactive session.
Make sense?

@norrishd
Copy link
Author

Yep makes total sense, thanks 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants