slowest rep worker blocks faster workers #195

toni-moreno · 2020-03-27T07:13:52Z

Hello !!

I'm trying to distribute job load by using the REQ/REP protocol with one master sending jobs and several agents getting these jobs and responding with information about the job as I explained here (#189)

Due to this issue #192 i'm currently working with c4b7a01

When one agent takes more than one minute to answer it seems like all outgoing messages from the master process are blocked for a while until the slowest agent reply. ( I've set option OptionRetryTime to 0 , to simply wait until it finish), my jobs could take easily more than 10 minutes

This is the output for 2 agents (jonh taking 10 seconds to complete, and mike taking 300 seconds to complete), and master working with 2 independent threads.

This is the output from master after sent and receive responses from each agent

TRIGGER WORKER[0] Init
TRIGGER WORKER[1] Init
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:08] RECEIVED MSG [AGENT [jonh] 07:35:09] AT  [07:35:19]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:08] RECEIVED MSG [AGENT [jonh] 07:35:19] AT  [07:35:29]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:20] RECEIVED MSG [AGENT [jonh] 07:35:29] AT  [07:35:39]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:30] RECEIVED MSG [AGENT [jonh] 07:35:39] AT  [07:35:49]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:40] RECEIVED MSG [AGENT [jonh] 07:35:49] AT  [07:35:59]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:50] RECEIVED MSG [AGENT [jonh] 07:35:59] AT  [07:36:09]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:36:00] RECEIVED MSG [AGENT [jonh] 07:36:09] AT  [07:36:19]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:10] RECEIVED MSG [AGENT [jonh] 07:36:19] AT  [07:36:29]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:30] RECEIVED MSG [AGENT [jonh] 07:36:30] AT  [07:36:40]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:36:20] RECEIVED MSG [AGENT [mike] 07:36:20] AT  [07:41:20]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:41:20] RECEIVED MSG [AGENT [jonh] 07:41:20] AT  [07:41:30]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:40] RECEIVED MSG [AGENT [mike] 07:41:20] AT  [07:46:20]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:46:20] RECEIVED MSG [AGENT [jonh] 07:46:20] AT  [07:46:30]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:41:31] RECEIVED MSG [AGENT [mike] 07:46:20] AT  [07:51:20]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:51:20] RECEIVED MSG [AGENT [jonh] 07:51:20] AT  [07:51:30]

I first start "jonh agent" and responding every 10 seconds, but when I started "mike", all the queue stops .

I expect to have in master 1 message from mike followed by 30 messages from jonh , but only one happened.

This is the code I've runned for this test

https://gist.github.com/toni-moreno/bc4c1a05973923c7c011d29300a1c1b5

You could start master and agents with this commands :

master

go run pipeline.go master tcp://127.0.0.1:40899 2

agents

go run pipeline.go agent tcp://127.0.0.1:40899 jonh 10

go run pipeline.go agent  tcp://127.0.0.1:40899 mike 300

There is any missing option to configure the outgoing queue to make this work as I spect ? Or is a bug?

Thank you very much in advance!

The text was updated successfully, but these errors were encountered:

toni-moreno · 2020-03-27T10:37:07Z

Hi , I have test again changing the time that the slowest is taking in finish the job.

In this case master with 5 wokers, jonh taking 10 seconds and mike 30 secons.

go run pipeline.go master tcp://127.0.0.1:40899 5

agents

go run pipeline.go agent tcp://127.0.0.1:40899 jonh 10

go run pipeline.go agent  tcp://127.0.0.1:40899 mike 30

Expected results, are 3 responses from jonh while waiting for mike, until the test stops, but it did only twice, after that slowest are blocking the fastest job. (NOTE: I've started Mike 30 seconds after John)

user@ubuntu00:~/o/task_queue_poc/mangos/examples/pipeline$ go run pipeline.go master tcp://127.0.0.1:40899  5
TRIGGER WORKER[0] Init
TRIGGER WORKER[1] Init
TRIGGER WORKER[2] Init
TRIGGER WORKER[3] Init
TRIGGER WORKER[4] Init
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:17:57] AT  [11:18:07]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:07] AT  [11:18:17]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:17] AT  [11:18:27]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:27] AT  [11:18:37]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:37] AT  [11:18:47]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:18:07] RECEIVED MSG [AGENT [jonh] 11:18:47] AT  [11:18:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:18:27] RECEIVED MSG [AGENT [mike] 11:18:27] AT  [11:18:57]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:18:17] RECEIVED MSG [AGENT [jonh] 11:18:57] AT  [11:19:07]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:18:37] RECEIVED MSG [AGENT [jonh] 11:19:07] AT  [11:19:17]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:18:57] RECEIVED MSG [AGENT [jonh] 11:19:17] AT  [11:19:27]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:18:47] RECEIVED MSG [AGENT [mike] 11:18:57] AT  [11:19:27]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:07] RECEIVED MSG [AGENT [jonh] 11:19:27] AT  [11:19:37]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:19:27] RECEIVED MSG [AGENT [jonh] 11:19:37] AT  [11:19:47]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:37] RECEIVED MSG [AGENT [jonh] 11:19:47] AT  [11:19:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:18:58] RECEIVED MSG [AGENT [mike] 11:19:27] AT  [11:19:57]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:57] RECEIVED MSG [AGENT [jonh] 11:19:57] AT  [11:20:07]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:20:08] RECEIVED MSG [AGENT [jonh] 11:20:08] AT  [11:20:18]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:19:17] RECEIVED MSG [AGENT [mike] 11:19:57] AT  [11:20:27]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:20:28] RECEIVED MSG [AGENT [jonh] 11:20:28] AT  [11:20:38]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:19:28] RECEIVED MSG [AGENT [mike] 11:20:27] AT  [11:20:57]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:20:58] RECEIVED MSG [AGENT [jonh] 11:20:58] AT  [11:21:08]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:19:47] RECEIVED MSG [AGENT [mike] 11:20:57] AT  [11:21:27]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:21:28] RECEIVED MSG [AGENT [jonh] 11:21:28] AT  [11:21:38]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:19:58] RECEIVED MSG [AGENT [mike] 11:21:27] AT  [11:21:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:21:58] RECEIVED MSG [AGENT [jonh] 11:21:58] AT  [11:22:08]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:20:18] RECEIVED MSG [AGENT [mike] 11:21:57] AT  [11:22:27]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:22:28] RECEIVED MSG [AGENT [jonh] 11:22:28] AT  [11:22:38]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:20:38] RECEIVED MSG [AGENT [mike] 11:22:27] AT  [11:22:57]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:22:58] RECEIVED MSG [AGENT [jonh] 11:22:58] AT  [11:23:08]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:21:08] RECEIVED MSG [AGENT [mike] 11:22:57] AT  [11:23:27]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:23:28] RECEIVED MSG [AGENT [jonh] 11:23:28] AT  [11:23:38]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:21:38] RECEIVED MSG [AGENT [mike] 11:23:27] AT  [11:23:57]

gdamore · 2020-03-29T18:57:16Z

I confirm that I see the same behavior. This is clearly a bug.

gdamore · 2020-03-29T21:58:54Z

So it looks like it isn't taking that long to go out. Rather the problem is that bit more subtle. We complete the send, and then the underlying pipe is marked available for sending again. The TCP buffering creates a problem here. The requests are getting back logged against that slow pipe -- they've already been scheduled there.

Basically, at the heart of it, TCP doesn't give us an adequate idea of the flow control -- we really need to have some idea from the peer that the receiver is actually ready to work on the job (by having a worker blocking in Recv().

Having said that, I'm still researching, because I don't think this completely explains the behavior.

gdamore · 2020-03-29T22:00:15Z

I think fixing this is going to require further protocol changes, unfortunately. :/

gdamore · 2020-03-29T22:18:13Z

I believe I can mitigate this somewhat, if not entirely fixing it. Basically, when we receive a message from REP peer, we can prioritize that pipe for the next send. That means that a slow pipe can cause one more message to get queued, but it should not cause more than that. And we should prefer to avoid that pipe if other pipes are moving along.

gdamore · 2020-03-29T22:24:05Z

So my change in branch bug195 seems to fix this... but it comes at the possible of risking starvation for slow workers. A fast worker will continue to get jobs and get stuck at the head of the queue. That's probably a good thing actually.

gdamore · 2020-03-29T22:35:14Z

Deep TCP buffers are still a real problem. I'm not really sure what to do about it in the short term -- naive approaches won't work in the face of fan out designs using devices or systems with many workers sharing a single socket. We an upper layer flow control mechanism. Still that branch seems to make things vastly better. There are a few jobs that penalized badly by getting stuck behind a slow worker but the rest move along.

If you've got a design that requires a mix very long and very short workers, you might want to consider some changes to the approach:

Break the jobs up onto different sockets if you can precharacterize them.
Use an upper layer queueing design -- so your work request is "submit a job", but you get a response back saying it was queued with a job ID. Then you collect the status later.
Consider inverting the logic, where the REQ is actually trying to get the job, and the REP provides the work. When the REQ is done, it can post the result back to the REP socket as another request.

None of these are particularly elegant, I understand. Factoring in consideration of this should be part of what we consider in any new protocol design. (I am considering ways to enhance the protocol as the current one is limited in too many ways.)

toni-moreno changed the title ~~[BUG] On Req/Rep The slowest (>1minute?) agent block shared message queue.~~ [BUG] On Req/Rep The slowest agent block shared message queue. Mar 28, 2020

gdamore added the bug Something isn't working label Mar 29, 2020

gdamore changed the title ~~[BUG] On Req/Rep The slowest agent block shared message queue.~~ slowest rep worker blocks faster workers Mar 29, 2020

gdamore closed this as completed in 3139ecf Mar 29, 2020

gdamore added a commit that referenced this issue Aug 18, 2020

Backport #195 slowest rep worker blocks faster workers to v2

9f3b6a0

gdamore added a commit that referenced this issue Aug 18, 2020

Backport #195 slowest rep worker blocks faster workers to v2

36ac52b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slowest rep worker blocks faster workers #195

slowest rep worker blocks faster workers #195

toni-moreno commented Mar 27, 2020 •

edited

Loading

toni-moreno commented Mar 27, 2020 •

edited

Loading

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

slowest rep worker blocks faster workers #195

slowest rep worker blocks faster workers #195

Comments

toni-moreno commented Mar 27, 2020 • edited Loading

toni-moreno commented Mar 27, 2020 • edited Loading

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

gdamore commented Mar 29, 2020

toni-moreno commented Mar 27, 2020 •

edited

Loading

toni-moreno commented Mar 27, 2020 •

edited

Loading