Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transports/ssh: support proxy_jump #4951

Merged
merged 11 commits into from
Jul 19, 2021

Conversation

dev-zero
Copy link
Contributor

@dev-zero dev-zero commented May 18, 2021

No description provided.

@codecov
Copy link

codecov bot commented May 18, 2021

Codecov Report

Merging #4951 (9295bcf) into develop (3fe12c7) will decrease coverage by 0.01%.
The diff coverage is 81.36%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4951      +/-   ##
===========================================
- Coverage    80.12%   80.12%   -0.00%     
===========================================
  Files          515      515              
  Lines        36699    36742      +43     
===========================================
+ Hits         29402    29435      +33     
- Misses        7297     7307      +10     
Flag Coverage Δ
django 74.60% <81.36%> (+0.01%) ⬆️
sqlalchemy 73.52% <81.36%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/transports/plugins/ssh.py 79.91% <81.36%> (-0.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fe12c7...9295bcf. Read the comment docs.

@ltalirz
Copy link
Member

ltalirz commented May 18, 2021

thanks for the addition @dev-zero !

from my side this is very welcome, we should just think about how to best avoid confusing new users with two ways of configuration and guide them to the new proxyjump command, both in the docs and through the verdi cli.

I could even imagine deprecating (and eventually removing) the proxycommand option in the cli, i.e. just keeping the support for it in the python API for existing profiles. But maybe keeping both is better...

@dev-zero
Copy link
Contributor Author

The proxycommand may be needed for some more exotic cases where people are tunneling through nc.
Wrt docs and guiding the user: this is why I kept it as an RFC for now, I just need a band-aid for now and the proxycommand seems atm to be one of the culprits of hanging workers.

@dev-zero
Copy link
Contributor Author

Oh, and how beautifully it works :-) just submitted 100+ calculations, not a single wait, no workers hitting 100% CPU

@dev-zero
Copy link
Contributor Author

@ltalirz ok, updated the docs and help strings

@dev-zero dev-zero changed the title RFC: transports/ssh: support proxy_jump transports/ssh: support proxy_jump May 18, 2021
@dev-zero dev-zero force-pushed the feature/proxyjump branch from 210b199 to 6afe61f Compare May 18, 2021 14:18
@dev-zero dev-zero requested a review from mbercx May 18, 2021 14:18
@ltalirz
Copy link
Member

ltalirz commented May 18, 2021

@ltalirz ok, updated the docs and help strings

Thanks ! With "docs" are you referring to the aiida-core documentation?
I guess the place I had in mind was this

By the way, do I understand correctly that for you this fix got rid of the symptom #4876 (comment) ?

Do you know whether it is due to the use of the ProxyJump or whether it is due to not using a separate process group?

It seems to me that the separate process group was introduced by @greschd in 00ed7f6 , perhaps he can comment on the reasoning behind it.

@dev-zero
Copy link
Contributor Author

@ltalirz ok, updated the docs and help strings

Thanks ! With "docs" are you referring to the aiida-core documentation?
I guess the place I had in mind was this

Ah, right, I thought more about the documentation strings for the CLI.

By the way, do I understand correctly that for you this fix got rid of the symptom #4876 (comment)

That, and all other SSH-related issues I've been seeing.

Do you know whether it is due to the use of the ProxyJump or whether it is due to not using a separate process group?

Guessing from paramiko/paramiko#1180 I'd say it is the implementation of ProxyCommand in general (and not necessarily ours) and the switch to ProxyJump avoids it.

@ltalirz
Copy link
Member

ltalirz commented May 18, 2021

Wow, so using ProxyCommand with paramiko can let you end up in an infinite loop, and this was known since 2017 paramiko/paramiko#963 and never fixed despite the open PR paramiko/paramiko#1180 . That's a little worrying and perhaps suggests to look at alternatives at some point as suggested here...

In any case, in the short term it makes the proxyjump support even more pressing (if this does indeed get rid of the problem).

@sphuber @giovannipizzi Do you have any comments on the basics of the implementation here?
@greschd Should one use the separate process group?

@sphuber
Copy link
Contributor

sphuber commented May 19, 2021

Interesting, glad you found the source of your problems. I have to say, I wouldn't have thought to look at the SSH connections in that infinite recursion error. The weird thing is, I have been using the current setup to submit calculations over SSH with proxy very heavily, as in tens of thousands of jobs, without problem. This was for the high-throughput project on Daint which we connect to through Ela. The configuration of the computer authentication in AiiDA that we have been using looks like

ssh -i ~/.ssh/id_rsa [email protected] /usr/bin/netcat daint.cscs.ch 22`.

Would you have expected this to also run into the issue you have been observing. Or is that the kind of setup that you referred to as

The proxycommand may be needed for some more exotic cases where people are tunneling through nc

and so here nc is the netcat that we are using?

@dev-zero
Copy link
Contributor Author

dev-zero commented May 19, 2021 via email

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dev-zero !

Since this adds support for a new feature, and in addition solves issues, I'm in favour of this!
I didn't check the implementation in detail but the idea seems good.

If we believe that the proxy jump is the 'better' way of doing things, I agree on documenting it as the "default suggestion" in our docs (and just mention proxy_command for advanced users). I wouldn't deprecate the other one for now as there might be use cases for it (already the fact that proxy_command comes later, and the comment on proxy_command to use the proxy jump instead, are enough).

As a comment, I've also been using a lot the proxy command with daint also recently. However, in the past (I think AiiDA 0.x) I remember having (different? or maybe similar) issues with it, where (when using a different nc executable on daint, and probably a different proxy_command string) the processes were remaining as zombie (but this was happening on daint - maybe in the end they were happening also on my computer, but I noticed because I was contacted by the admins that saw hundreds of processes on the jump host ela :-D

Final comment - good to check the various YML we have lying around, to make sure that those 1. continue working or 2. are updated to the new proxy jump. In particular for AiiDAlab (pinging @yakutovicha and @csadorf).

On this PR: I would add changes to the docs as discussed, as well as tests on the CLI to test the new CLI option.

@dev-zero
Copy link
Contributor Author

dev-zero commented May 19, 2021

for completeness the proxy_command I've been using (taken from a yml for Daint):

proxy_command          ssh -W eiger.cscs.ch:22 [email protected]

and the ps auxf output after verdi daemon stop and receiving the TIMEOUT:

tiziano  25417  0.1  0.5 730948 86728 ?        Sl   09:01   0:03 /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default daemon start-circus 6
tiziano  25421 92.9  1.1 1284608 193876 ?      Ssl  09:01  45:13  \_ /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default devel run_daemon
tiziano  26423  0.0  0.0      0     0 ?        Zs   09:06   0:00  |   \_ [ssh] <defunct>
tiziano  25422 61.8  1.1 1288476 195836 ?      Ssl  09:01  30:06  \_ /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default devel run_daemon
tiziano  28813  0.0  0.0      0     0 ?        Zs   09:22   0:00  |   \_ [ssh] <defunct>
tiziano  25424 93.4  1.2 1291724 199788 ?      Ssl  09:01  45:27  \_ /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default devel run_daemon
tiziano  26426  0.0  0.0      0     0 ?        Zs   09:06   0:00  |   \_ [ssh] <defunct>
tiziano  25425 79.7  1.1 1279192 186600 ?      Ssl  09:01  38:49  \_ /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default devel run_daemon
tiziano  27538  0.0  0.0      0     0 ?        Zs   09:13   0:00  |   \_ [ssh] <defunct>
tiziano  25426 92.6  1.0 1264632 172400 ?      Ssl  09:01  45:06  \_ /scratch/tiziano/virtualenvs/aiida/bin/python /scratch/tiziano/virtualenvs/aiida/bin/verdi -p default devel run_daemon
tiziano  26270  0.0  0.0      0     0 ?        Zs   09:05   0:00      \_ [ssh] <defunct>

@dev-zero
Copy link
Contributor Author

dev-zero commented May 19, 2021

Update: one likely has to explicitly close the SSHConnection created for the proxy since we're opening a new channel and passing this channel as a sock to the next connection. Hence even if Paramiko calls close on the sock it will not close the complete connection. After explicitly closing the proxies the number of threads stay constant:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM nTH     TIME+ COMMAND
13760 rabbitmq  20   0 5433772 142620   8192 S 0.332 0.873 159   8:33.83 beam.smp                                                                                                                                                                                   
 8075 tiziano   20   0 3226700  91072  23068 S 0.000 0.557  35  12:56.47 mysqld                                                                                                                                                                                     
 8013 tiziano   20   0 2643216  46116  33088 S 0.000 0.282  28   0:45.53 akonadiserver                                                                                                                                                                              
 3353 tiziano   20   0 1884656 327704  36008 S 0.000 2.005  26   0:50.08 python                                                                                                                                                                                     
31340 tiziano   20   0 1744568 188948  36128 S 0.000 1.156  26   2:21.03 python                                                                                                                                                                                     
 4776 root      20   0 2610536 192040  51152 S 0.332 1.175  25 103:08.49 dockerd                                                                                                                                                                                    
15607 tiziano   20   0 1718640 177116  36276 S 0.000 1.084  24   1:04.16 python                                                                                                                                                                                     
15742 tiziano   20   0 1639916 174172  35828 S 0.000 1.066  23   0:06.34 python                                                                                                                                                                                     
 5154 root      20   0 1645060  56196  24920 S 0.000 0.344  21  21:02.97 docker-containe                                                                                                                                                                            
 4902 netdata   20   0  263016  61688   3288 S 0.664 0.377  19 144:28.07 netdata                                                                                                                                                                                    
 7991 tiziano   20   0 1991200  92156  68064 S 0.000 0.564  19   0:00.16 xdg-desktop-por                                                                                                                                                                            
 3430 tiziano   20   0 1451080 181576  36784 S 8.638 1.111  18   0:28.48 verdi                                                                                                                                                                                      
 3431 tiziano   20   0 1438212 173352  35668 S 0.000 1.061  16   0:24.93 verdi                                                                                                                                                                                      
 3432 tiziano   20   0 1433820 171460  35428 S 0.000 1.049  16   0:25.00 verdi                                                                                                                                                                                      
 1145 polkitd   20   0 2057348  43516  18060 S 0.000 0.266  12   0:01.20 polkitd                                                                                                                                                                                    
 7832 tiziano   20   0 2119452 326984 155244 S 0.000 2.001  11   1:09.44 plasmashell                                                                                                                                                                                
 8010 tiziano   20   0 1896744 193880 114028 S 0.000 1.186  11   1:42.14 nextcloud                                                                                                                                                                                  
11180 tiziano   20   0 1510604 227144 109160 S 0.000 1.390  11  20:49.84 kscreenlocker_g                                                                                                                                                                            
 3428 tiziano   20   0  697328  90900  20548 S 0.000 0.556   9   0:02.44 verdi                                                                                                                                                                                      
 3429 tiziano   20   0  954760 118936  24304 S 0.000 0.728   9   0:11.14 verdi                                                                                                                                                                                      
 3433 tiziano   20   0  697352  91084  20692 S 0.000 0.557   9   0:02.45 verdi 

There's one caveat to the solution: since this relies on more Paramiko ssh connections (one per proxy and one for the machine) I seem to spawn too many threads. From top (with the nTH (number of threads) column enable):

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM nTH     TIME+ COMMAND                                                                                                                                                                                    
18450 tiziano   20   0 7858248 237060  37784 S 14.62 1.450 321  76:55.02 verdi                                                                                                                                                                                      
18449 tiziano   20   0 7805784 229932  37328 S 6.312 1.407 319  73:36.11 verdi                                                                                                                                                                                      
18447 tiziano   20   0 7859568 221544  37356 S 5.980 1.356 325  74:18.38 verdi                                                                                                                                                                                      
18448 tiziano   20   0 7851892 239564  38556 S 5.980 1.466 320  75:17.40 verdi                                                                                                                                                                                      
18451 tiziano   20   0 7832944 231792  37440 S 5.980 1.418 322  69:26.28 verdi                                                                                                                                                                                      
18452 tiziano   20   0 7795188 240476  37084 S 5.980 1.471 318  73:59.64 verdi      

which causes:

-bash: fork: retry: Resource temporarily unavailable

My guess is a combination of:

@dev-zero dev-zero force-pushed the feature/proxyjump branch 2 times, most recently from 0a28f31 to 7ee5b58 Compare May 19, 2021 09:39
@ramirezfranciscof ramirezfranciscof self-assigned this May 19, 2021
@dev-zero dev-zero force-pushed the feature/proxyjump branch from 7ee5b58 to 01b343c Compare May 19, 2021 12:27
@dev-zero
Copy link
Contributor Author

Ok, all done. The following warnings are as far as I can tell reported from bogus locations since the CI run from the commit before showed the same messages, but emitted simply from a different location. And ran the reported testcase via a script which monkey-patched file and socket open/close and printed tracebacks based on tracemalloc and couldn't reproduce it.

NEW:

2021-05-19T12:43:04.5716799Z tests/transports/test_ssh.py::test_gotocomputer_proxyjump
2021-05-19T12:43:04.5718005Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/abc.py:86: ResourceWarning: unclosed file <_io.FileIO name=74 mode='wb' closefd=True>
2021-05-19T12:43:04.5718810Z     _abc_init(cls)
2021-05-19T12:43:04.5719033Z 
2021-05-19T12:43:04.5719535Z tests/transports/test_ssh.py::test_gotocomputer_proxyjump
2021-05-19T12:43:04.5720710Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/abc.py:86: ResourceWarning: unclosed file <_io.FileIO name=75 mode='rb' closefd=True>
2021-05-19T12:43:04.5721480Z     _abc_init(cls)
2021-05-19T12:43:04.5721714Z 
2021-05-19T12:43:04.5722203Z tests/transports/test_ssh.py::test_gotocomputer_proxyjump
2021-05-19T12:43:04.5723389Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/abc.py:86: ResourceWarning: unclosed file <_io.FileIO name=77 mode='rb' closefd=True>
2021-05-19T12:43:04.5724162Z     _abc_init(cls)
2021-05-19T12:43:04.5724383Z 

https://pipelines.actions.githubusercontent.com/lx6f5qJRWQcbiE7vU42yTxrR7aWtThmdNBqldi7jyXdZiiVFpF/_apis/pipelines/1/runs/11842/signedlogcontent/24?urlExpires=2021-05-19T12%3A46%3A18.9245123Z&urlSigningMethod=HMACV1&urlSignature=tv%2BRl9us0wUkgaVtg%2BtoZNvWoqOs2UxxNoyZcT4ulwg%3D

PREVIOUS:

2021-05-16T16:22:23.8667487Z tests/workflows/arithmetic/test_add_multiply.py::test_run
2021-05-16T16:22:23.8668712Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/traceback.py:359: ResourceWarning: unclosed file <_io.FileIO name=59 mode='wb' closefd=True>
2021-05-16T16:22:23.8669617Z     result.append(FrameSummary(
2021-05-16T16:22:23.8669953Z 
2021-05-16T16:22:23.8670445Z tests/workflows/arithmetic/test_add_multiply.py::test_run
2021-05-16T16:22:23.8671647Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/traceback.py:359: ResourceWarning: unclosed file <_io.FileIO name=60 mode='rb' closefd=True>
2021-05-16T16:22:23.8672559Z     result.append(FrameSummary(
2021-05-16T16:22:23.8672976Z 
2021-05-16T16:22:23.8673450Z tests/workflows/arithmetic/test_add_multiply.py::test_run
2021-05-16T16:22:23.8674671Z   /opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/traceback.py:359: ResourceWarning: unclosed file <_io.FileIO name=63 mode='rb' closefd=True>
2021-05-16T16:22:23.8675574Z     result.append(FrameSummary(

https://pipelines.actions.githubusercontent.com/lx6f5qJRWQcbiE7vU42yTxrR7aWtThmdNBqldi7jyXdZiiVFpF/_apis/pipelines/1/runs/11774/signedlogcontent/24?urlExpires=2021-05-19T13%3A25%3A52.6201442Z&urlSigningMethod=HMACV1&urlSignature=eivYPNcjaiguS7TPL%2FQl29J2j%2Bp5CiGxEwGJtxeMU3k%3D

Also, those warnings from the previous commit don't show up in this PRs CI run and there was nothing in between which could have fixed them.

@dev-zero
Copy link
Contributor Author

Since I was leaking threads/filedescriptors/whatnot in an earlier version of this patch I stumbled over this blog article via HN which is about file descriptors (and how a lot of other things like sockets, timers are filedescriptors nowadays): http://0pointer.net/blog/file-descriptor-limits.html
For AiiDA the limit of open files per process is therefore 1024 since Paramiko (and maybe others) uses select and should you ever have the hunch that you might be hitting that limit without the code properly handling the errors from it:

$ for pid in $(pgrep verdi) ; do echo "PID = $pid with $(ls /proc/$pid/fd/ | wc -l) file descriptors" ; done

Might be something to extend verdi daemon status with it? (together with "number of threads", virt/res memory) And maybe also add a check that the user didn't set the soft limit for number of open files beyond 1024.

@giovannipizzi
Copy link
Member

Interesting... I read briefly the link. So we should make sure we don't use a limit over 1024, at least as long as we use paramiko? Ok for me to add to verdi test if it's important to check - @sphuber @ramirezfranciscof @chrisjsewell what do you think?

@dev-zero
Copy link
Contributor Author

dev-zero commented Jun 3, 2021

Interesting... I read briefly the link. So we should make sure we don't use a limit over 1024, at least as long as we use paramiko?

Correct. There might be other plugins which still use select (psycopg mabe?), but paramiko is definitely using it. Normally I wouldn't care about it too much because the softlimit is 1024, but I see something like ulimit -a unlimited when people are doing calculations and there may be a good chance that verdi is run with that setting too.

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @dev-zero

I was hoping @greschd could perhaps give a quick comment on whether this approach without separate process groups seems fine.

in the meanwhile I already gave a quick look

aiida/transports/plugins/ssh.py Outdated Show resolved Hide resolved
aiida/transports/plugins/ssh.py Outdated Show resolved Hide resolved
aiida/transports/plugins/ssh.py Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Show resolved Hide resolved
@dev-zero
Copy link
Contributor Author

dev-zero commented Jun 6, 2021

@ltalirz I wasn't happy with the documentation and redid it. I hope it is clearer now.

@dev-zero
Copy link
Contributor Author

dev-zero commented Jun 6, 2021

please squash commits on merge

@mbercx
Copy link
Member

mbercx commented Jun 6, 2021

Sorry for not having looked that this earlier, but I'm still having a quick look now, so don't merge yet! 😅

@ltalirz ltalirz self-requested a review June 6, 2021 17:30
ltalirz
ltalirz previously approved these changes Jun 6, 2021
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltalirz I wasn't happy with the documentation and redid it. I hope it is clearer now.

All good, an of course your point about the IdentityFile was well taken.

@mbercx I'm unsubscribing from this thread. Please take care of merging this after your review

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this, @dev-zero! We've been having issues with using ProxyCommand when connecting to CSCS clusters, so having support for ProxyJump will be a big plus.

I haven't reviewed the code in much detail, I trust @ltalirz has that covered. I just wanted to try this out by following the documentation to see if everything is clear. However, I haven't been able to make this work for my connection to daint. Maybe I messed up somewhere (it's a Sunday after all)? Here's the relevant piece in the ~/.ssh/config:

$ grep -A 8 daint-test ~/.ssh/config 
Host daint-test
  HostName daint.cscs.ch
  User mbercx
  IdentityFile ~/.ssh/id_rsa-daint
  ProxyJump [email protected]

Host ela.cscs.ch
  IdentityFile ~/.ssh/id_rsa-daint

ssh daint-test works fine. I first set up the computer with:

$ verdi computer setup --config computer/daint-test.yaml 
Success: Computer<5> daint-test created
Info: Note: before the computer can be used, it has to be configured with the command:
Info:   verdi computer configure ssh daint-test

where

$ more computer/daint-test.yaml 
---
label: daint-test
description: Piz Daint with XC50 Haswell GPU (12 CPUs, 1 GPU) - PRACE project
hostname: daint.cscs.ch
transport: ssh
scheduler: slurm
shebang: '#!/bin/bash'
work_dir: /scratch/snx3000/{username}/aiida
mpirun_command: srun -u -n {tot_num_mpiprocs} --hint=nomultithread --unbuffered
mpiprocs_per_machine: 12
prepend_text: |
   #SBATCH -A pr110
   #SBATCH -C gpu
append_text: ' '

I then configured the ssh interactively:

$ verdi computer configure ssh daint-test
Info: enter "?" for help
Info: enter "!" to ignore the default and set no value
User name [mbercx]: 
Port number [22]: 
Look for keys [True]: 
SSH key file []: 
Connection timeout in s [60]: 
Allow ssh agent [True]: 
SSH proxy jump []: [email protected]
SSH proxy command []: 
Compress file transfers [True]: 
GSS auth [False]: 
GSS kex [False]: 
GSS deleg_creds [False]: 
GSS host [daint.cscs.ch]: 
Load system host keys [True]: 
Key policy (RejectPolicy, WarningPolicy, AutoAddPolicy) [RejectPolicy]: 
Use login shell when executing command [True]: 
Connection cooldown time (s) [30.0]: 
Info: Configuring computer daint-test for user [email protected].
Success: daint-test successfully configured for [email protected]

When trying to run a job, it gets stuck in the upload step:

$ verdi process list
  PK  Created    Process label    Process State    Process status
----  ---------  ---------------  ---------------  ----------------------------------
 559  5m ago     PwBaseWorkChain  ⏵ Waiting        Waiting for child processes: 564
 564  5m ago     PwCalculation    ⏵ Waiting        Waiting for transport task: upload

Total results: 2

Info: last time an entry changed state: 5m ago (at 17:25:28 on 2021-06-06)

And the report has the following error message:

$ verdi process report 564
*** 564: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2021-06-06 17:25:59.102930+00:00
 | Traceback (most recent call last):
 |   File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 80, in do_upload
 |     transport = await cancellable.with_interrupt(request)
 |   File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/usr/lib/python3.8/asyncio/futures.py", line 175, in result
 |     raise self._exception
 |   File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/transports/plugins/ssh.py", line 502, in open
 |     proxy_client.connect(proxy['host'], **proxy_connargs)
 |   File "/home/mbercx/.virtualenvs/aiida-dev/lib/python3.8/site-packages/paramiko/client.py", line 435, in connect
 |     self._auth(
 |   File "/home/mbercx/.virtualenvs/aiida-dev/lib/python3.8/site-packages/paramiko/client.py", line 765, in _auth
 |     raise SSHException("No authentication methods available")
 | paramiko.ssh_exception.SSHException: No authentication methods available

My OpenSSH version is 7.6:

$ ssh -V
OpenSSH_7.6p1 Ubuntu-4ubuntu0.3, OpenSSL 1.0.2n  7 Dec 2017

And I'm using Ubuntu 18.04.5 LTS:

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"

Let me know if I messed up somewhere! In this case, sorry for delaying the merge. 😅
I've also left some minor comments re the documentation.

docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
docs/source/howto/ssh.rst Outdated Show resolved Hide resolved
@dev-zero
Copy link
Contributor Author

dev-zero commented Jul 12, 2021

@mbercx difficult to say what went wrong since the error from the log is missing which would contain more information (especially the final login information used to login to the proxy).

Here's my config and the output of verdi computer test eiger:

$ verdi computer configure show eiger
* username               timuel
* port                   22
* look_for_keys          True
* key_filename
* timeout                60
* allow_agent            True
* proxy_jump             [email protected]
* proxy_command
* compress               True
* gss_auth               False
* gss_kex                False
* gss_deleg_creds        False
* gss_host               eiger.cscs.ch
* load_system_host_keys  True
* key_policy             RejectPolicy
* use_login_shell        True
* safe_interval          30.0
$ verdi computer test eiger
Info: Testing computer<eiger> for user<[email protected]>...* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... [OK]: 94 jobs found in the queue
* Determining remote user name... [OK]: timuel
* Creating and deleting temporary file... [OK]
Success: all 5 tests succeeded

@dev-zero dev-zero requested a review from mbercx July 12, 2021 14:04
@ltalirz
Copy link
Member

ltalirz commented Jul 12, 2021

thanks @dev-zero

@mbercx can you please check the changes and provide additional input on the issue you observed?
e.g. the configuration of the ela and daint hosts from your SSH config + the output of verdi computer test daint

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ltalirz and @dev-zero! So, it turns out the issue was related to the fact that I was relying on look_for_keys: True. I.e. when configuring the test computer with the following YAML file:

username: mbercx
port: 22
look_for_keys: True
key_filename: ""
timeout: 60
allow_agent: True
proxy_jump: [email protected]
proxy_command: ""
compress: True
gss_auth: False
gss_kex: False
gss_deleg_creds: False
gss_host: daint.cscs.ch
load_system_host_keys: True
key_policy: RejectPolicy
use_login_shell: True
safe_interval: 30.0

Sidenote: I'm using double quotes to avoid prompts when configuring the computer with the --config option. Can you see any issues with that? 🤔

This resulted in the error I also find when trying to run calculations when I do:

$ verdi computer test daint-test
Info: Testing computer<daint-test> for user<[email protected]>...
* Opening connection... 07/14/2021 06:24:16 AM <31330> aiida.transport.SshTransport: [ERROR] Error connecting to proxy 'ela.cscs.ch' through SSH: 
[SshTransport] No authentication methods available, connect_args were: {'username': 'mbercx', 'port': 22, 'look_for_keys': True, 'timeout': 60, 'allow_agent': True, 'compress': True, 'gss_auth': False, 'gss_kex': False, 'gss_deleg_creds': False, 'gss_host': 'daint.cscs.ch'}
[FAILED]: Error while trying to connect to the computer
  Use the `--print-traceback` option to see the full traceback.
Warning: 1 out of 0 tests failed

(I added another carriage return to make sure the the No authentication methods available is clearly visible.)

This may be related to the fact that I have a rather large number of keys in my ~/.ssh/ folder. If I specify the correct key in the filename:

key_filename: /home/mbercx/.ssh/id_rsa-daint

and reconfigure the computer, all is well on the western front:

$ verdi computer test daint-test
Info: Testing computer<daint-test> for user<[email protected]>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... 07/14/2021 06:33:20 AM <32573> aiida.scheduler.slurm: [WARNING] Unrecognized job_state 'RD' for job id 31333865
[OK]: 1887 jobs found in the queue
* Determining remote user name... [OK]: mbercx
* Creating and deleting temporary file... [OK]
Success: all 5 tests succeeded

Also tried submitting some calculations, and didn't encounter anymore issues!

@dev-zero
Copy link
Contributor Author

Wrt large number of keys: AFAIK ssh considers every login attempt with a key it doesn't know as an unsuccessful login attempt and will likely disconnect after 3 attempts, probably resulting in the error message you see. If you have been using ProxyCommand before instead: this one was using your ~/.ssh/config for connecting to ela (since an ssh command was run), hence it was using the settings you have in there for ela (which probably included a hard-coded key).

@mbercx
Copy link
Member

mbercx commented Jul 14, 2021

Wrt large number of keys: AFAIK ssh considers every login attempt with a key it doesn't know as an unsuccessful login attempt and will likely disconnect after 3 attempts, probably resulting in the error message you see. If you have been using ProxyCommand before instead: this one was using your ~/.ssh/config for connecting to ela (since an ssh command was run), hence it was using the settings you have in there for ela (which probably included a hard-coded key).

Right, makes sense, thanks @dev-zero! This all looks good to me. Just updated the branch, can you (squash and) merge? I typically prefer to write my own commit messages, so I'll leave that to you. 🙃

@dev-zero
Copy link
Contributor Author

@mbercx I'm not authorized to merge

@mbercx
Copy link
Member

mbercx commented Jul 15, 2021

Alright, happy to write a short commit message, but since it'll be associated with your account, maybe you'd like to write it yourself so I can add it when I merge?

@ltalirz
Copy link
Member

ltalirz commented Jul 15, 2021

@dev-zero just gave you maintainer rights for the aiida-core repo; let me know if you're still unable to merge

@dev-zero dev-zero merged commit da179dc into aiidateam:develop Jul 19, 2021
@dev-zero
Copy link
Contributor Author

@ltalirz thanks, looks good

@dev-zero dev-zero deleted the feature/proxyjump branch July 21, 2021 14:07
sphuber pushed a commit that referenced this pull request Aug 8, 2021
SSH provides multiple ways to forward connections. The legacy way is via SSHProxyCommand which spawns a separate process for each jump host/proxy. Controlling those processes is error prone and lingering/hanging processes have been observed (#4940 and others, depending on the setup). This commit adds support for the SSHProxyJump feature which permits to setup an arbitrary number of proxy jumps without additional processes by creating TCP channels over existing (Paramiko) connections. This gives a good control over the lifetime of the different connections and since a users SSH config is not re-read after the initial setup gives a controlled environment.
Hence it has been decided to make this new directive the recommended default in the documentation while still supporting both ways.

Co-authored-by: Marnik Bercx <[email protected]>
Co-authored-by: Leopold Talirz <[email protected]>

Cherry-pick: da179dc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants