You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:
Minimal Complete Verifiable Example:
importglobdefmain():
# Read CSV file in parallel across workersimportdask_cudfdf=dask_cudf.read_csv(glob.glob("*.csv"))
# Fit a NearestNeighbors model and query itfromcuml.dask.neighborsimportNearestNeighborsnn=NearestNeighbors(n_neighbors=10, client=client)
nn.fit(df)
neighbors=nn.kneighbors(df)
if__name__=="__main__":
# Initialize UCX for high-speed transport of CUDA arraysfromdask_cudaimportLocalCUDACluster# Create a Dask single-node CUDA cluster w/ one worker per devicecluster=LocalCUDACluster()
fromdask.distributedimportClientclient=Client(cluster)
main()
In addition to that, I have this submission script
Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
await wait_for(self.start_unsafe(), timeout=timeout)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
return await fut
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
raise plugins_exceptions[0]
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
result = plugin.setup(worker=self)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
return await aw
^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
raise self.__startup_exc
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
return await fut
^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
response = await self.instantiate()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
msg = await self._wait_until_connected(uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
raise msg["exception"]
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
async with worker:
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
await self
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
await wait_for(self.start_unsafe(), timeout=timeout)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
return await fut
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
raise plugins_exceptions[0]
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
result = plugin.setup(worker=self)
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
return await aw
^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
raise self.__startup_exc
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
return await fut
^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
response = await self.instantiate()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
msg = await self._wait_until_connected(uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
raise msg["exception"]
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
async with worker:
^^^^^^^^^^^^^^^^^
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
await self
File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it
Environment:
Dask version: 2024.7.1
dask-jobqueue: 0.9.0
Python version: 3.11.9
Operating System: Linux (Slurm HPC)
Install method (conda, pip, source): conda
The text was updated successfully, but these errors were encountered:
Given that this is related to LocalCUDACluster I would recommend opening this issue on https://github.com/rapidsai/dask-cuda instead. Unfortunately GitHub doesn't allow me to transfer this between orgs.
Describe the issue:
I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:
Minimal Complete Verifiable Example:
In addition to that, I have this submission script
Error Message
Anything else we need to know?:
The traceback was pretty long, I gave only a snippet of it
Environment:
The text was updated successfully, but these errors were encountered: