Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DASK Deployment using SLURM with GPUs #1381

Closed
AquifersBSIM opened this issue Sep 6, 2024 · 18 comments · Fixed by #1420
Closed

DASK Deployment using SLURM with GPUs #1381

AquifersBSIM opened this issue Sep 6, 2024 · 18 comments · Fixed by #1420

Comments

@AquifersBSIM
Copy link

Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:

Minimal Complete Verifiable Example:

import glob

def main():

    # Read CSV file in parallel across workers
    import dask_cudf
    df = dask_cudf.read_csv(glob.glob("*.csv"))

    # Fit a NearestNeighbors model and query it
    from cuml.dask.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors = 10, client=client)
    nn.fit(df)
    neighbors = nn.kneighbors(df)

if __name__ == "__main__":

    # Initialize UCX for high-speed transport of CUDA arrays
    from dask_cuda import LocalCUDACluster

    # Create a Dask single-node CUDA cluster w/ one worker per device
    cluster = LocalCUDACluster()
    
    from dask.distributed import Client
    client = Client(cluster)

    main()

In addition to that, I have this submission script

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
/usr/bin/time -v python 1.py

Error Message

Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it

Environment:

  • Dask version: 2024.7.1
  • dask-jobqueue: 0.9.0
  • Python version: 3.11.9
  • Operating System: Linux (Slurm HPC)
  • Install method (conda, pip, source): conda
@pentschev
Copy link
Member

Could you please report the output of nvidia-smi topo -m and the output of the script below? Please make sure both run on some Slurm node where you experienced the original failure as reported above.

print_affinity.py
import pynvml
from dask_cuda.utils import get_cpu_affinity


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    cpu_affinity = get_cpu_affinity(i)
    print(type(get_cpu_affinity(i)), get_cpu_affinity(i))

@AquifersBSIM
Copy link
Author

Hi @pentschev , I have forgotted to mention that I have disabled the "os.sched_setaffinity(0, self.cores)", as attached below

class CPUAffinity(WorkerPlugin):
    def __init__(self, cores):
        self.cores = cores

    def setup(self, worker=None):
        pass
        #os.sched_setaffinity(0, self.cores)

@pentschev
Copy link
Member

Keep in mind doing that will likely result in degraded performance. Here's a previous comment I wrote about this on a similar issue.

@AquifersBSIM
Copy link
Author

AquifersBSIM commented Sep 11, 2024

Thank you @pentschev for the reply on me disabling the os.sched_setaffinity. I probably need some time to report the output of nvidia-smi topo -m.

Regarding the "print_affinity.py":
Do i have to enable back the os.sched_setaffinity for it work?

@AquifersBSIM
Copy link
Author

AquifersBSIM commented Sep 12, 2024

Hi @pentschev, Here are the reports of nvidia-smi topo -m and the print_affinity.py. For your information, I have not enabled the os.sched_setaffinity.

nvidia-smi topo -m output

        GPU0    GPU1    GPU2    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     NV4     SYS             3               N/A
GPU1    NV4      X      NV4     SYS             1               N/A
GPU2    NV4     NV4      X      NODE            5               N/A
NIC0    SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

print_affinity.py output

<class 'list'> []
<class 'list'> []
<class 'list'> []

@pentschev
Copy link
Member

@AquifersBSIM can you clarify what you mean by "I have not enabled the os.sched_setaffinity"? Do you mean that when you ran the above you had the line commented out as in your previous #1381 (comment)? If so, that doesn't really matter for the experiment above.

In any case, that unfortunately didn't really clarify whether the failure was in obtaining the CPU affinity or something else happened. Would you please run the following modified version of the script on the compute node?

print_affinity2.py
import math
from multiprocessing import cpu_count

import pynvml


pynvml.nvmlInit()
for i in range(pynvml.nvmlDeviceGetCount()):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)
    cpu_affinity = pynvml.nvmlDeviceGetCpuAffinity(handle, math.ceil(cpu_count() / 64))
    print(list(cpu_affinity))

Furthermore, the output of nvidia-smi topo -m looks very unusual on that system, do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? Could you also post the information from cat /proc/cpuinfo from that node?

@AquifersBSIM
Copy link
Author

Hello @pentschev, regarding the "os.sched_setaffinity", I had the line commented out.

Regarding the do you know if you're getting just a partition of the node or if you should have the full node with exclusive access for your allocation? question. I am sure I am just getting a partition of the node.

Information from print_affinity2.py

[0]
[32768]
[0]

Information from cat /proc/cpuinfo
The information is very lengthy, and if its alright, here is a snippet of it

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7543 32-Core Processor
stepping        : 1
microcode       : 0xa0011d5
cpu MHz         : 3662.940
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 1
cpu cores       : 32
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
tscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
 aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
 bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflus
hopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat
npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpcl
mulqdq rdpid overflow_recov succor smca debug_swap
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips        : 5589.37
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

@pentschev
Copy link
Member

pentschev commented Oct 3, 2024

So if you're getting only a partition of the node, does that mean you don't have access to all the CPU cores as well? That could be the reason why properly determining the CPU affinity fails, and TBH I have no experience with that sort of partitioning and don't know if that is indeed supported by NVML either. If you know details, can you provide more information about the CPU status, e.g., how many physical CPUs (i.e., sockets) are there, how many cores you actually see with /proc/cpuinfo, and anything else that you can provide for us to better understand what the topology of the system/cluster?

@AquifersBSIM
Copy link
Author

AquifersBSIM commented Nov 1, 2024

Hi @pentschev, FWIW, here is the information that I have gotten from my admin

This is regarding why the CPU affinity fails

most likely dask doesn't understand cgroups, which are used extensively in HPC, so it's trying to bind processes to the wrong cores. so affinity fails.

affinity is VERY difficult to do correctly with modern NUMA and chiplets and cgroups and PCIe irq affinities and everything else

I believe this would be an explanation to the topology of the system/cluster

affinity tries to lock a task to a core (or set of cores) and not let the kernel move around. the idea is to keep a task right next to specific hardware, like a gpu or ram, so that it runs marginally faster.

slurmd+cgroups give the job a fixed set of eg. 8 cores - whatever your job requests.
but dask's logic can't handle that it doesn't have access to the whole node. so (most likely) it tries to lock/bind a task to a core that is not accessible from the cgroup.
or it gets the core numbering totally wrong

@pentschev
Copy link
Member

Thanks for the details @AquifersBSIM , this is indeed helpful.

You are partly right, Dask does not know anything about cgroup, nor should it (I think), all the handling is done via NVML. I inquired with the NVML team and it is not clear yet but it could be a bug. I've been asked to get more details from you so we can confirm this. Could you help answer the following questions?

  1. What are the cgroup settings being specified to either the container/node partition for your allocation?
  2. Are you using cgroup v1 or v2?
  3. Are allocations running Docker containers or just a partition of the host OS limited by cgroup?

@AquifersBSIM
Copy link
Author

Hi @pentschev, these are my answer to the questions:

  1. cpuset, memory, as set by slurm to match the requested cpu,mem resources of the user job
  2. v1
  3. I dont think i am using docker containers

@pentschev
Copy link
Member

Thanks @AquifersBSIM for the information. We have tried to reproduce this on our end with cgroup but we have been unsuccessful. To be able to investigate this further we need to reproduce the issue on our end, could you please confirm also the following?

  1. Versions you're running of:
    1. OS
    2. NVIDIA driver
    3. NVML
    4. hwloc
  2. Exact cgroup options/cgroup->slurm translations that are specified to the allocation

@AquifersBSIM
Copy link
Author

Hello @pentschev, Thanks for the question and your help. I think I fixed the issue by requesting for the whole node. Have a look at the following output:

Allow me to send a new easier script to run

import os
import dask.array as da
from dask.distributed import Client
import time
from contextlib import contextmanager
from distributed.scheduler import logger
import socket
from dask_cuda import LocalCUDACluster

@contextmanager
def timed(txt):
    t0 = time.time()
    yield
    t1 = time.time()
    print("%32s time:  %8.5f" % (txt, t1 - t0))

def example_function():
    print(f"start example")
    x = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
    y = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
    z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1, 2)).compute()
    print(z)


if __name__ == "__main__":
    
    cluster = LocalCUDACluster()
    client = Client(cluster)

    with timed("test"):
        example_function()

And this is my .sh

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=8
#SBATCH --time=1:00:00
#SBATCH --mem=64g
#SBATCH --gres=gpu:2
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
echo 'THE MAIN SCRIPT'
python 2.py
echo 'PRINT AFFINITY SCRIPT'
python print_affinity2.py
cd $JOBFS
echo 'GETTING THE INFORMATION NOW'
echo 'INFORMATION 1'
nvidia-smi topo -m
echo 'INFORMATION 2'
cat /proc/cpuinfo

This is the output and traceback, correct me if I am wrong, I think dask has worked, because the calculation actually started?

THE MAIN SCRIPT
start example
[1571490.92765732 1570013.54888877 1570977.20412696 ... 1570347.65845145
 1571052.66370303 1570350.35175422]
                            test time:  1816.48540
2024-11-06 23:12:29,206 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1250, in heartbeat
    response = await retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils_comm.py", line 459, in retry_operation
    return await retry(
           ^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils_comm.py", line 438, in retry
    return await coro()
           ^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 1254, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 1013, in send_recv
    response = await comm.read(deserializers=deserializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:52952 remote=tcp://127.0.0.1:46309>: Stream is
 closed

Information from print_affinity2.py

PRINT AFFINITY SCRIPT
[4278190080]
[65280]

@pentschev
Copy link
Member

Thank you @AquifersBSIM , I appreciate the additional information, and I agree the affinity looks closer to what we expect, that now means you have 8 CPUs for each GPU and those probably match how your cluster admin partitioned the CPUs/GPUs. Can you clarify if the only changes you've done from the initial report are the SBATCH entries? Originally you reported:

#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4

On your latest message you reported:

#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=8
#SBATCH --time=1:00:00
#SBATCH --mem=64g
#SBATCH --gres=gpu:2

Is that all or did you have other changes?

I would also appreciate if you could share as much of the information as I requested previously in #1381 (comment) too, that can be valuable for us in identifying behavioral difference and also to provide better instructions of setting up partitioning to match proper affinity, which seems like something our documentation is currently lacking.

@itzsimpl
Copy link

@pentschev I have stumbled on the same issue; if this helps, a minimal example on an 8x A100 80GB PCIe system with 2 sockets and a total of 112 CPUs, a request for a specific CPU count will result in missing info in the CPU Affinity list as reported by nvidIa-smi:

$ srun --gpus=A100_80GB:8 -c32 --mem 128GB --pty nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NODE    NODE    NODE    NODE    NODE    NODE    NODE    0-15,56-71      0               N/A
GPU1    NV12     X      NODE    NODE    NODE    NODE    NODE    NODE    NODE    0-15,56-71      0               N/A
GPU2    NODE    NODE     X      NV12    NODE    NODE    NODE    NODE    NODE    0-15,56-71      0               N/A
GPU3    NODE    NODE    NV12     X      NODE    NODE    NODE    NODE    NODE    0-15,56-71      0               N/A
GPU4    NODE    NODE    NODE    NODE     X      NV12    NODE    NODE    NODE            1               N/A
GPU5    NODE    NODE    NODE    NODE    NV12     X      NODE    NODE    NODE            1               N/A
GPU6    NODE    NODE    NODE    NODE    NODE    NODE     X      NV12    PHB             1               N/A
GPU7    NODE    NODE    NODE    NODE    NODE    NODE    NV12     X      NODE            1               N/A
NIC0    NODE    NODE    NODE    NODE    NODE    NODE    PHB     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0

Output of print_affinity2.py is

18374686479671689215, 255]
[18374686479671689215, 255]
[18374686479671689215, 255]
[18374686479671689215, 255]
[0, 0]
[0, 0]
[0, 0]
[0, 0]

in exclusive mode the CPU Affinity will have the entire list

$ srun -p dev -w ana --gpus=A100_80GB:8 --exclusive --mem 128GB --pty nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU1    NV12     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU2    NODE    NODE     X      NV12    SYS     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU3    NODE    NODE    NV12     X      SYS     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV12    NODE    NODE    NODE    28-55,84-111    1               N/A
GPU5    SYS     SYS     SYS     SYS     NV12     X      NODE    NODE    NODE    28-55,84-111    1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV12    PHB     28-55,84-111    1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV12     X      NODE    28-55,84-111    1               N/A
NIC0    SYS     SYS     SYS     SYS     NODE    NODE    PHB     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0

The output of print_affinity2.py is

[18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[18374686479940059135, 1048575]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080]
[72057593769492480, 281474975662080]

There are obviously examples when one may wish to run dask in non-exclusive mode, e.g. not use all of the GPUs on a node, or not hog all CPUs if strictly not necessary, and accept the degraded performance. One possible solution for such cases could be to wrap line

os.sched_setaffinity(0, self.cores)
as:

try:
  os.sched_setaffinity(0, self.cores)
except:
  pass

and perhaps log a message in case affinity cannot be set (e.g. notify the user that this is suboptimal and they should run the job in exclusive mode).

@mdefende
Copy link

Hi, I've also run into this issue just recently. Just in case more example cases are needed, I'll post the different trials I've run and what I got to work:

Node information

OS: RHEL 7
Slurm version: 18.08.9
Hardware:
  - 2x NVIDIA 80GB A100s
  - 2 sockets (128 total cores)
  - GPU affinity:
    - GPU 0: cores 0-63
    - GPU 1: cores 64-127

Test 1: 1 task, 64 cores per task (Affinity Error)

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=1 --cpus-per-task=64 --partition=amperenodes --gres=gpu:2 --pty nvidia-smi topo -m                                                                                
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-63    0               N/A
GPU1    SYS      X      SYS     NODE            1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X 

This setup resulted in an affinity error since all of the cores allocated to the job were only part of socket 0 so GPU1 didn't have any cores available.

Test 2: 2 tasks, 32 cores per task (Affinity Error)

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --partition=amperenodes --gres=gpu:2 --pty nvidia-smi topo -m                                                                                
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-63    0               N/A
GPU1    SYS      X      SYS     NODE            1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X 

Since all of the cores in the prior test were only from socket 0, I thought creating two tasks (one for interacting with each GPU) could fix the issue by distributing the cores across both sockets. This was not the case as all cores were still from socket 0.

Test 3: 64 tasks, 1 core per task (No error)

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=64 --partition=amperenodes --gres=gpu:2 --pty nvidia-smi topo -m                                                                                                  
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-31    0               N/A
GPU1    SYS      X      SYS     NODE    64-95   1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X 

Requesting 64 tasks fixed the issue by evenly distributing the tasks across the two sockets. I'm not sure why the previous test didn't assign each task to cores from both sockets while the 64 task test evenly split them. But it looks like if you specify ntasks to the number of GPUs and can bind each task to a specific socket assigned to each GPU, it should be fine. Luckily that's the case when including ntasks-per-socket.

Test 4: 2 tasks, 32 cores per task, 1 task per socket (No error)

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --ntasks-per-socket=1 --partition=amperenodes --gres=gpu:2 --pty nvidia-smi topo -m                                                          
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-31    0               N/A
GPU1    SYS      X      SYS     NODE    64-95   1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X 

Imo, this is probably the Slurm resource specification that most closely matches the job description. 2 total tasks each managing cores assigned evenly to each GPU. I'm not sure if there is a real-world difference between this and the simpler ntasks=64 setup in test 3. I'm somewhat naive in how other clusters are set up, but at least for cases where GPUs are assigned cores from specific sockets at the hardware level, you should be able to use the ntasks-per-socket option to solve the affinity issue

@itzsimpl
Copy link

@mdefende Be mindful that --ntasks=X will run the supplied command X times; the use of --pty when you submit the job masks this out; remove it and you will see the output of nvidia-smi topo -m repeated X times. This may not always be the desired result, e.g. if you wish to run a command only once (as a single task), but you still wish to give it X cores and full access to all GPUs (for p2p), as the command itself is capable of taking advantage of multiprocessing.

Can you check what your slurm.conf has under TaskPlugin and what the content of gres.conf is? I.e. if you have the GPUs listed by hand or you use AutoDetect=nvml.

@pentschev
Copy link
Member

Thanks everyone for the comments here and thorough help with debugging, we really appreciate it!

Since this isn't something Dask-CUDA can really fix on its own because the limitation comes from the resources being allocated by Slurm, as noted previously by @itzsimpl the best is to allow Dask-CUDA to continue and print a warning at the same time to the user, this is now being addressed in #1420 along with some documentation to let other users troubleshoot more easily. Please feel free to try it out and provide feedback, hopefully it will be merged in the next couple days, before the holidays.

@rapids-bot rapids-bot bot closed this as completed in fd8a736 Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants