-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One worker does not shut down after code has finished running #58
Comments
I've duplicated this again with a smaller version of the problem that eliminates the unresponsive loop warnings, so it appears that is not the problem. I've run this with a debug level logger as well which is below, the number between the logger name and the level is the processid. One thing I notice on this run is that when the scheduler shuts down the worker that fails to shutdown tries to reconnect to the scheduler afterwards unlike the other workers: As a temporary workaround, is there a way to have the worker quit if it disconnects from the scheduler for some amount of time? Log is too big to paste, so it's attached. I've also attached the previous full size log. The log size is referencing the size of the problem not the size of the log, just FYI because they are backwards as the full size log is info only. |
Hey, @kjgoodrick. Thanks for the issue! I haven't had time to look into this, but I've noticed some worker hanging and other issues in a different set of work I am doing. Maybe the two issues are related, but maybe not. Both problems appear on Linux, so maybe there's something related. |
@kjgoodrick: I can't reproduce this issue. Can you post a working minimal example script? |
Hi @kmpaul: Thanks for looking into this! I will see what I can do to make this reproduceable with some generic code. If I can get it to happen consistently, I'll let you know. Based on the second log I posted (smaller_log.txt), I think it may be difficult though as I think this is caused by a race condition. If the scheduler shuts down while the worker is communicating with it the worker tries to send a heartbeat and then will try to reconnect forever. Based on the log this is what I think is happening: 1.) A worker is reading the message from the scheduler telling it to close: 2.) While this is happening, the scheduler closes and the worker raises a CommClosedError which is handled in the 3.) 4.) The connection was broken when the worker was sent the close message, so the status is still running which gives us our first log message on line 7957: 5.) The worker then runs its heartbeat and for some reason the channel is busy so it skips it (log line 7961). Things get a little fuzzy for me here, but maybe a normal heartbeat was running at the same time? For whatever reason, at some point a heartbeat gets a response with a status of missing and then tries to reregister with the scheduler. When this happens, we get an As a hacky solution I should just be able to set reconnect as |
@kjgoodrick: If you are correct, then it seems like this error should occur with non-MPI Dask, right? I can't see anything that is MPI-specific cropping up in the chain of events. Perhaps this should be a Distributed issue? @mrocklin: What do you think? (Or should I ask someone else for advice?) |
@kmpaul If I am right it does seem like the issue does not have to do with the MPI side of things. I have only seen this problem when running in an MPI environment though. If I run the same code on just one CPU without the MPI initialization it stops as it should every time. I don't know if this is caused by some part the MPI setup or if it's just a coincidence though. |
Yeah. It's hard to diagnose, for sure. The test without the MPI initialization is suspicious, though. |
What happened:
Code is run with
mpirun
across two nodes each with 24 cores. Dask launches scheduler on one core and 46 workers as expected. Code executes as expected and dask begins to shut down workers. However, one worker is not shut down and continues trying to reconnect with the scheduler until the job is canceled by the machine job scheduler.What you expected to happen:
All workers close and the program ends.
Minimal Complete Verifiable Example:
Anything else we need to know?:
The worker that does not stop apparently is never sent a stop command
For example, a worker that does stop:
The script is launched with:
mpirun -n 48 python real_script.py
I also sometimes get messages like:
distributed.core - INFO - Event loop was unresponsive in Worker for 37.01s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Not sure if this is causing it, but it's the only thing that looks out of place in the logs. My workers do also write work to disk.I also tried using the nanny option, but it does not work on the machine I'm using.
Environment:
Full log:
Full log
The text was updated successfully, but these errors were encountered: