-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask Scheduler host/port Not Written to Skein Key-Value Storage When YARN Application Restarts #135
Comments
Are you using
Not really. There's a way to make a job shutdown if it has no pending work (by setting an idle timeout), but not if it has no workers. Having no workers is usually a temporary state (the cluster will scale back up), so this is a bit of an uncommon ask.
Same as the above, no, not really. You could monitor this yourself through periodically checking I suggest trying with a remote scheduler (unless that's already what you're doing) and seeing if that fixes things. |
@jcrist we originally started with the remote scheduler but found this doubled our failure footprint as if either the node running the Scheduler or the Application Master failed, the cluster would fail. That means that 2 nodes (AppMaster and Scheduler) presented failure points for our workflow instead of 1 (AppMaster). I tried running the Scheduler inside the ApplicationMaster container but dask-yarn didn't recognize this configuration and then started a local Scheduler alongside the Scheduler running the Application. Is there a way to force the ApplicationMaster and Scheduler to run on the same physical node? (when running the Scheduler in Edit to address your question: I'm leaving the Scheduler out of the list of services so it's created by default in |
Hmmm, I could see how that would happen. I think this will require a patch to dask-yarn. The check here should be expanded to also check if the application master has a If the scheduler goes down and restarts, how do you handle that from your client code? Dask (and dask-yarn) isn't really designed for this, so you must be doing something tricky to make it work. |
This is essentially the problem I'm trying to mitigate - I'm trying to reduce as many failure scenarios as possible. It seems to me that I'd prefer the AppMaster and Scheduler to live and die together, so to speak. Otherwise then I double my failure footprint if the Scheduler is on a separate node from the AppMaster (either failing will cause the cluster to fail). The amount of different options for restarts and retries is a little complex, so I'm just hunting for a setup that will either be robust to restarts or cause immediate failure that my client code can deal with - right now we're running into situations where the client hangs up because the workers and schedulers end up in a strange state.
Yes, possibly. It seems like a straightforward addition but I need to talk to my teammates to see what they think about our options. BTW, thank you for all your contributions! I appreciate them. |
If you run the scheduler on the application master, the cluster will properly handle restarts (the whole application will restart - scheduler, application master, and workers). But your client code won't handle this properly. Dask isn't designed to handle scheduler failures gracefully - usually in this case users resubmit the whole job. You could do this yourself in the client code by running a loop with a try-except block that tries to reconnect on scheduler failure using Usually we recommend users to write their jobs in a way that if the scheduler fails, they restart their client script as well. This wouldn't play well with YARN application restarts as currently written, but if the scheduler and your client-side script were all running on the application master container it would. |
We are running a Dask cluster using Dask-Yarn on AWS EMR. I am trying to make it more robust so we're running the Scheduler in local mode. To make the system more robust we allow multiple attempts of the YARN Application using
skein.ApplicationSpec(..., max_attempts=2)
.In the event that a node fails, and that node is running the ApplicationMaster, then the YARN application will make a second attempt - it will restart itself and all the workers. However, the Scheduler will not see these new workers. It will sit there happily waiting on workers and blocking our application flow, since there's pending work to be done, but no workers to do it on.
The problem is that when the YARN Application retries after the ApplicationMaster has failed, the scheduler's address and port number are not written to Key/Value store and the new workers don't know where to connect. This line (https://github.com/dask/dask-yarn/blob/master/dask_yarn/core.py#L558) appears to not re-run on the 2nd application attempt.
To verify this, on the 1st YARN application attempt you can navigate to
http://appmaster:20888/proxy/attempt_num/kv
and see thedask.scheduler
anddask.dashboard
urls listed in the KV storage. Then if you shutdown the node running the ApplicationMaster the application will start a 2nd attempt. Navigate tohttp://new_appmaster:20888/proxy/second_attempt_num/kv
and you'll see that nothing is listed in the KV storage, and no workers in the Dask dashboard.Question(s):
distributed.as_completed()
method with Futures that somehow timeout in the event that the Scheduler loses all its workers and they never come back?Environment:
The text was updated successfully, but these errors were encountered: