Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] who runs the scheduler? #1

Open
justheuristic opened this issue Jun 14, 2022 · 2 comments
Open

[Question] who runs the scheduler? #1

justheuristic opened this issue Jun 14, 2022 · 2 comments

Comments

@justheuristic
Copy link

First of all, thanks for the paper!
It was very intriguing to view model parallelism as an optimization problem in itself.

I wonder how would such scheduling work in a fully decentralized system?
Naively, you could run it concurrently on all nodes in hope that they find the same solution.

However, this naive option may be difficult to implement in geographically distributed networks: if nodes observe slightly different network bandwith, or if they take network measurements at a different time, they may end up with different solutions.

Is there a way to guarantee such network is consistent?
I mean, you can always elect a "leader" or let nodes vote on the solution, but perhaps there are more natural way to approach this.
What would you suggest?

p.s. another group that i'm in close contact faced similar issue their paper, and they ended up with a heuristic load-balancing rule where nodes greedily switch pipeline stages. However, unlike your work, they do not prove that such rule leads to optimal throughput.

@BinhangYuan
Copy link
Member

Hi @justheuristic ,

Thanks for your interests in our work! It is great to discuss this here.

As you may also see, we limit our scope under the regime that the network condition is stable and can be estimated relatively accurately by some network profiling. This might be violated in reality for sure. It is a very interesting problem about how to handle the dynamic decentralized environment with fault tolerance, in fact, this is a problem with the highest priority in our todo list. But to be honest, the answer to this question would be that we have no idea what should be the optimal design yet at the moment.

BTW, we are aware of the Swarm parallelism paper. In fact, we appreciate this paper and the other papers from the group on this topic!

Best wishes,
Binhang

@justheuristic
Copy link
Author

Hi, @BinhangYuan
I'm sorry, i didn't mean to skew the discussion towards the dynamic environment.

Can you please elaborate on what happens in your setup from a system design perspective?

In a static (or slowly changing) hardware configuration, one can indeed measure the network properties ahead of time. But how would nodes perform that in a decentralized setting? Would they elect a temporary "leader" that runs the profiler and optimization -- or follow some decentralized protocol?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants