-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usernetes with efa on AWS #322
Comments
oh wow it ran... but it was SO SLOW! oh my! I do wonder if I'm missing a bind... actually I think it's there? EFA on the host I think is here:
and in the container I see it too: # ls /dev/infiniband/uverbs0
/dev/infiniband/uverbs0 But (even though efa is installed in the container) it could still be that it's not working... going to quickly run the tests. |
Ah, got it working! I will post the full update tomorrow - basically I needed to hack the daemonset a bit, and then add the correct annotations for it to bind to the pod (of the job). Then I could run a sleep job, shell in, install It's super late here but I'm going to be doing experiments soon and can post the full details of the setup. |
Update here - I think we've solved all the performance issues, but I need to do a few more scaled tests! @AkihiroSuda can I ask you a quick question? For your usernetes paper here: https://arxiv.org/pdf/2402.00365v1 Does that component in Figure 1 (the "Intermediate NetNS" mentioned where the incoming / outgoing get routed through) get bypassed given using EFA? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html More specifically, in that diagram for EFA, would you say the usernetes "Intermediate NetNS" is part of the TCP/IP stack? Or the ENA device? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-basics. I ask because I did some tweaks to our setup and was able to get the same performance on bare metal as in usernetes, and I think I can explain it based on those diagrams (if this is the case). Thanks for your help! |
The "intermediate netns" just refers to RootlessKit's namespace i.e., the namespace of dockerd |
So if we use the libfabric API that connects directly to the device (hardware) I'm guessing we are bypassing all of that. |
hey @AkihiroSuda - happy weekend! I have a full setup working with flux and usernetes on AWS, and I added in the elastic fiber adapter (EFA) but it's absolutely not running. The link there has some background - it needs to leverage drivers on the host. Is there an extra bit of information / bind I need to add to the docker-compose setup to allow for that to happen? For context, the operator installs OK, and lammps even starts up OK, but (on a very small problem size that usually is done in 1-2 seconds on a bad machine) it's basically hanging:
And the above hangs there. I'm thinking perhaps for usernetes I need to bind the driver location on the host to somewhere in usernetes? Or something else? Let me know if you have insights. As always, thank you for your help!
The text was updated successfully, but these errors were encountered: