-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launching dask-mpi clusters #44
Comments
Moving this to dask/dask-mpi . This isn't currently supported, but it could be. In principle we need to create a class that meets some of the interface in https://github.com/dask/distributed/blob/master/distributed/deploy/cluster.py If you wanted to get a start on this, I would recommend subclassing from |
I think also that @andersy005 and @kmpaul may have had something like this in the past. I'm not seeing it now though, so I'm curiuos if maybe I removed it when we redid things for 2.0. (although the old version probably wouldn't have met the current-day constraints of the dask.distributed.Cluster interface) |
I feel like some context was lost in the transfer of this issue to @rcthomas I hate to ask you to repeat yourself, but can you describe what kind of functionality (exactly) you are asking about? |
Is this related to #9 in any way? |
I think that the ask here is for a Python class that follows the Cluster interface so that it can be integrated into the JupyterLab "new cluster" button, as is depicted here: https://youtu.be/EX_voquHdk0 |
Ah! Yes. This would be very cool. ...Let's talk about what this would look like, then, because I'm actually wondering if this is more of a fit for Dask-Jobqueue. |
Thanks for moving this to the right place. We're looking to enable both |
@rcthomas Is there a difference between the two approaches, other than that one uses a batch script and the other doesn't? In other words, would a user notice a difference if one of these approaches or the other was implemented behind the "new cluster" button? |
That is one difference and it is not clear to me that it's something that needs to be included or whether it can be handled via abstraction. Another difference is vanilla |
A minimal implementation will probably look like this ... class MPICluster(dask.distributed.Cluster):
async def _start(self):
self.proc = subprocess.popen("dask-mpi", ...)
await scheduler_has_started_up()
async def _close(self):
self.proc.close() And then you'll add a config file like the following: labextension:
factory:
module: 'dask_mpi'
class: 'MPICluster'
args: []
kwargs: {} I'm sure that it will be more complex than that, but in theory that should work. |
I've been studying the code a bit and I have a few questions and comments. To make things a bit more concrete I'm looking to encapsulate something like the following. Many examples will be less complicated but we want to make a more user friendly way to accomplish something like the following real (working!) example:
So @kmpaul if The next thing I wondered was whether @mrocklin you really meant to say Taking a step back though all I want to do is |
I meant Cluster. SpecCluster is based around running many individual worker objects, which you won't have. You have one big job, rather than a configuration of many jobs. |
@rcthomas Ok. A question for you about what you actually need. The command that you quoted above. Is this what you want to run when you click on the Dask Labextension button?
Or is this what you want to run when you click the Dask Labextension button?
If you want to run the former command, then it falls within the purview of Dask-Jobqueue. However, if you want to run the latter command, then it falls within the purview of Dask-MPI, and all we would need to do is create the What makes the most sense to you? |
@kmpaul I am so sorry it's taken me 2 months to come back to this. You've got a point. Backing up just a bit we I'll note that in principle the 2 versions of the command ( I'm in the same room as @lesteve at the Dask Developer workshop and so we'll talk about this tomorrow. |
We took a look at trying to adapt dask-jobqueue for salloc and it won't work. We're able to submit the job but dask-jobqueue looks for the job ID when the submit command returns. Of course here it doesn't ever return till the job times out. I'm looking at options, which now look like the original course I wanted to follow or an internal conversation about queue policy here. |
@rcthomas No worries about getting busing with other things! I completely understand. The fact that In standard Slurm Dask-Jobqueue, the So, instead, I tried to see if you could get the Slurm job ID from from subprocess import Popen, PIPE
from shlex import split
cmd = "salloc --account=NIOW0001 --ntasks=1 --time=00:02:00 --partition=dav sleep 10"
print(f'Running command: "{cmd}"')
p = Popen(split(cmd), stdout=PIPE, stderr=PIPE)
print('Looping until first output to stderr (or command completes)...')
while True:
err = p.stderr.readline().decode().strip()
if err or p.poll() is not None:
break
if err:
print(f'First output to stderr: "{err}"')
else:
print('No output found on stderr.')
print('Waiting until command finishes...')
out, err = p.communicate()
print('Command finished.')
print()
print('STDOUT:')
print(out.decode())
print()
print('STDERR:')
print(err.decode()) On NCAR's Casper system, the output I get from this script looks like this:
Would something like this work? You'd have to create a new Dask-Jobqueue class that held on to the |
@rcthomas That said, I still think that getting the |
Hello all! any advances in this "issue"? |
No progress yet. Is this a priority for you? I can try to accelerate this, but many things have come up over the last few months that have gotten in the way. |
We're heading in a different direction from salloc in the long run FWIW |
@rcthomas Because of the issues above? Or because of another reason? |
@kmpaul the queue we're putting these clusters into isn't really the right place for them. We've got another plan that's more in line with vanilla dask-jobqueue. Some kind of MPICluster object would still be something we'd want to work on. @jshleap yes, away from the salloc-srun setup (see above) and towards just sbatch |
Is this supported? We have the ability (through Slurm) to
srun ... dask-mpi...
which will launch the cluster on a queue set aside for interactive work. From the command line it's a single command. Other facilities may not have this capability (it may require a batch script).We're willing to take a look into how to make this work if it's needed. cc @shreddd
The text was updated successfully, but these errors were encountered: