-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slowest rep worker blocks faster workers #195
Comments
Hi , I have test again changing the time that the slowest is taking in finish the job. In this case master with 5 wokers, jonh taking 10 seconds and mike 30 secons. go run pipeline.go master tcp://127.0.0.1:40899 5 agents
Expected results, are 3 responses from jonh while waiting for mike, until the test stops, but it did only twice, after that slowest are blocking the fastest job. (NOTE: I've started Mike 30 seconds after John)
|
I confirm that I see the same behavior. This is clearly a bug. |
So it looks like it isn't taking that long to go out. Rather the problem is that bit more subtle. We complete the send, and then the underlying pipe is marked available for sending again. The TCP buffering creates a problem here. The requests are getting back logged against that slow pipe -- they've already been scheduled there. Basically, at the heart of it, TCP doesn't give us an adequate idea of the flow control -- we really need to have some idea from the peer that the receiver is actually ready to work on the job (by having a worker blocking in Recv(). Having said that, I'm still researching, because I don't think this completely explains the behavior. |
I think fixing this is going to require further protocol changes, unfortunately. :/ |
I believe I can mitigate this somewhat, if not entirely fixing it. Basically, when we receive a message from REP peer, we can prioritize that pipe for the next send. That means that a slow pipe can cause one more message to get queued, but it should not cause more than that. And we should prefer to avoid that pipe if other pipes are moving along. |
So my change in branch bug195 seems to fix this... but it comes at the possible of risking starvation for slow workers. A fast worker will continue to get jobs and get stuck at the head of the queue. That's probably a good thing actually. |
Deep TCP buffers are still a real problem. I'm not really sure what to do about it in the short term -- naive approaches won't work in the face of fan out designs using devices or systems with many workers sharing a single socket. We an upper layer flow control mechanism. Still that branch seems to make things vastly better. There are a few jobs that penalized badly by getting stuck behind a slow worker but the rest move along. If you've got a design that requires a mix very long and very short workers, you might want to consider some changes to the approach:
None of these are particularly elegant, I understand. Factoring in consideration of this should be part of what we consider in any new protocol design. (I am considering ways to enhance the protocol as the current one is limited in too many ways.) |
Hello !!
I'm trying to distribute job load by using the REQ/REP protocol with one master sending jobs and several agents getting these jobs and responding with information about the job as I explained here (#189)
Due to this issue #192 i'm currently working with c4b7a01
When one agent takes more than one minute to answer it seems like all outgoing messages from the master process are blocked for a while until the slowest agent reply. ( I've set option
OptionRetryTime
to 0 , to simply wait until it finish), my jobs could take easily more than 10 minutesThis is the output for 2 agents (jonh taking 10 seconds to complete, and mike taking 300 seconds to complete), and master working with 2 independent threads.
This is the output from master after sent and receive responses from each agent
I first start "jonh agent" and responding every 10 seconds, but when I started "mike", all the queue stops .
I expect to have in master 1 message from mike followed by 30 messages from jonh , but only one happened.
This is the code I've runned for this test
https://gist.github.com/toni-moreno/bc4c1a05973923c7c011d29300a1c1b5
You could start master and agents with this commands :
master
agents
There is any missing option to configure the outgoing queue to make this work as I spect ? Or is a bug?
Thank you very much in advance!
The text was updated successfully, but these errors were encountered: