-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DASK Deployment using SLURM with GPUs #1381
Comments
Could you please report the output of print_affinity.pyimport pynvml
from dask_cuda.utils import get_cpu_affinity
pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
cpu_affinity = get_cpu_affinity(i)
print(type(get_cpu_affinity(i)), get_cpu_affinity(i)) |
Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below
|
Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue. |
Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of Regarding the "print_affinity.py": |
Hi @pentschev, Here are the reports of nvidia-smi topo -m output
print_affinity.py output
|
@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above. In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node? print_affinity2.py
Furthermore, the output of |
Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out. Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node. Information from
Information from
|
So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with |
Hi @pentschev, FWIW, here is the information that I have gotten from my admin This is regarding why the CPU affinity fails most likely dask doesn't understand cgroups, which are used extensively in HPC, so it's trying to bind processes to the wrong cores. so affinity fails. affinity is VERY difficult to do correctly with modern NUMA and chiplets and cgroups and PCIe irq affinities and everything else I believe this would be an explanation to the topology of the system/cluster affinity tries to lock a task to a core (or set of cores) and not let the kernel move around. the idea is to keep a task right next to specific hardware, like a gpu or ram, so that it runs marginally faster. slurmd+cgroups give the job a fixed set of eg. 8 cores - whatever your job requests. |
Thanks for the details @AquifersBSIM , this is indeed helpful. You are partly right, Dask does not know anything about cgroup, nor should it (I think), all the handling is done via NVML. I inquired with the NVML team and it is not clear yet but it could be a bug. I've been asked to get more details from you so we can confirm this. Could you help answer the following questions?
|
Hi @pentschev, these are my answer to the questions:
|
Thanks @AquifersBSIM for the information. We have tried to reproduce this on our end with cgroup but we have been unsuccessful. To be able to investigate this further we need to reproduce the issue on our end, could you please confirm also the following?
|
Hello @pentschev, Thanks for the question and your help. I think I fixed the issue by requesting for the whole node. Have a look at the following output: Allow me to send a new easier script to run import os
import dask.array as da
from dask.distributed import Client
import time
from contextlib import contextmanager
from distributed.scheduler import logger
import socket
from dask_cuda import LocalCUDACluster
@contextmanager
def timed(txt):
t0 = time.time()
yield
t1 = time.time()
print("%32s time: %8.5f" % (txt, t1 - t0))
def example_function():
print(f"start example")
x = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
y = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1, 2)).compute()
print(z)
if __name__ == "__main__":
cluster = LocalCUDACluster()
client = Client(cluster)
with timed("test"):
example_function() And this is my .sh
This is the output and traceback, correct me if I am wrong, I think dask has worked, because the calculation actually started?
Information from
|
Thank you @AquifersBSIM , I appreciate the additional information, and I agree the affinity looks closer to what we expect, that now means you have 8 CPUs for each GPU and those probably match how your cluster admin partitioned the CPUs/GPUs. Can you clarify if the only changes you've done from the initial report are the
On your latest message you reported:
Is that all or did you have other changes? I would also appreciate if you could share as much of the information as I requested previously in #1381 (comment) too, that can be valuable for us in identifying behavioral difference and also to provide better instructions of setting up partitioning to match proper affinity, which seems like something our documentation is currently lacking. |
@pentschev I have stumbled on the same issue; if this helps, a minimal example on an 8x A100 80GB PCIe system with 2 sockets and a total of 112 CPUs, a request for a specific CPU count will result in missing info in the CPU Affinity list as reported by $ srun --gpus=A100_80GB:8 -c32 --mem 128GB --pty nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NODE NODE NODE NODE NODE NODE NODE 0-15,56-71 0 N/A
GPU1 NV12 X NODE NODE NODE NODE NODE NODE NODE 0-15,56-71 0 N/A
GPU2 NODE NODE X NV12 NODE NODE NODE NODE NODE 0-15,56-71 0 N/A
GPU3 NODE NODE NV12 X NODE NODE NODE NODE NODE 0-15,56-71 0 N/A
GPU4 NODE NODE NODE NODE X NV12 NODE NODE NODE 1 N/A
GPU5 NODE NODE NODE NODE NV12 X NODE NODE NODE 1 N/A
GPU6 NODE NODE NODE NODE NODE NODE X NV12 PHB 1 N/A
GPU7 NODE NODE NODE NODE NODE NODE NV12 X NODE 1 N/A
NIC0 NODE NODE NODE NODE NODE NODE PHB NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
Output of print_affinity2.py is 18374686479671689215, 255]
[18374686479671689215, 255]
[18374686479671689215, 255]
[18374686479671689215, 255]
[0, 0]
[0, 0]
[0, 0]
[0, 0] in exclusive mode the CPU Affinity will have the entire list $ srun -p dev -w ana --gpus=A100_80GB:8 --exclusive --mem 128GB --pty nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NODE NODE SYS SYS SYS SYS SYS 0-27,56-83 0 N/A
GPU1 NV12 X NODE NODE SYS SYS SYS SYS SYS 0-27,56-83 0 N/A
GPU2 NODE NODE X NV12 SYS SYS SYS SYS SYS 0-27,56-83 0 N/A
GPU3 NODE NODE NV12 X SYS SYS SYS SYS SYS 0-27,56-83 0 N/A
GPU4 SYS SYS SYS SYS X NV12 NODE NODE NODE 28-55,84-111 1 N/A
GPU5 SYS SYS SYS SYS NV12 X NODE NODE NODE 28-55,84-111 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NV12 PHB 28-55,84-111 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NV12 X NODE 28-55,84-111 1 N/A
NIC0 SYS SYS SYS SYS NODE NODE PHB NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
The output of print_affinity2.py is [18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080] There are obviously examples when one may wish to run dask in non-exclusive mode, e.g. not use all of the GPUs on a node, or not hog all CPUs if strictly not necessary, and accept the degraded performance. One possible solution for such cases could be to wrap line dask-cuda/dask_cuda/plugins.py Line 15 in d70f28d
try:
os.sched_setaffinity(0, self.cores)
except:
pass and perhaps log a message in case affinity cannot be set (e.g. notify the user that this is suboptimal and they should run the job in exclusive mode). |
Hi, I've also run into this issue just recently. Just in case more example cases are needed, I'll post the different trials I've run and what I got to work: Node information
Test 1: 1 task, 64 cores per task (Affinity Error)
This setup resulted in an affinity error since all of the cores allocated to the job were only part of socket 0 so GPU1 didn't have any cores available. Test 2: 2 tasks, 32 cores per task (Affinity Error)
Since all of the cores in the prior test were only from socket 0, I thought creating two tasks (one for interacting with each GPU) could fix the issue by distributing the cores across both sockets. This was not the case as all cores were still from socket 0. Test 3: 64 tasks, 1 core per task (No error)
Requesting 64 tasks fixed the issue by evenly distributing the tasks across the two sockets. I'm not sure why the previous test didn't assign each task to cores from both sockets while the 64 task test evenly split them. But it looks like if you specify Test 4: 2 tasks, 32 cores per task, 1 task per socket (No error)
Imo, this is probably the Slurm resource specification that most closely matches the job description. 2 total tasks each managing cores assigned evenly to each GPU. I'm not sure if there is a real-world difference between this and the simpler |
@mdefende Be mindful that Can you check what your |
Thanks everyone for the comments here and thorough help with debugging, we really appreciate it! Since this isn't something Dask-CUDA can really fix on its own because the limitation comes from the resources being allocated by Slurm, as noted previously by @itzsimpl the best is to allow Dask-CUDA to continue and print a warning at the same time to the user, this is now being addressed in #1420 along with some documentation to let other users troubleshoot more easily. Please feel free to try it out and provide feedback, hopefully it will be merged in the next couple days, before the holidays. |
Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:
Minimal Complete Verifiable Example:
In addition to that, I have this submission script
Error Message
Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it
Environment:
The text was updated successfully, but these errors were encountered: