Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slowest rep worker blocks faster workers #195

Closed
toni-moreno opened this issue Mar 27, 2020 · 7 comments
Closed

slowest rep worker blocks faster workers #195

toni-moreno opened this issue Mar 27, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@toni-moreno
Copy link

toni-moreno commented Mar 27, 2020

Hello !!

I'm trying to distribute job load by using the REQ/REP protocol with one master sending jobs and several agents getting these jobs and responding with information about the job as I explained here (#189)

Due to this issue #192 i'm currently working with c4b7a01

When one agent takes more than one minute to answer it seems like all outgoing messages from the master process are blocked for a while until the slowest agent reply. ( I've set option OptionRetryTime to 0 , to simply wait until it finish), my jobs could take easily more than 10 minutes

This is the output for 2 agents (jonh taking 10 seconds to complete, and mike taking 300 seconds to complete), and master working with 2 independent threads.

This is the output from master after sent and receive responses from each agent

TRIGGER WORKER[0] Init
TRIGGER WORKER[1] Init
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:08] RECEIVED MSG [AGENT [jonh] 07:35:09] AT  [07:35:19]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:08] RECEIVED MSG [AGENT [jonh] 07:35:19] AT  [07:35:29]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:20] RECEIVED MSG [AGENT [jonh] 07:35:29] AT  [07:35:39]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:30] RECEIVED MSG [AGENT [jonh] 07:35:39] AT  [07:35:49]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:35:40] RECEIVED MSG [AGENT [jonh] 07:35:49] AT  [07:35:59]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:35:50] RECEIVED MSG [AGENT [jonh] 07:35:59] AT  [07:36:09]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:36:00] RECEIVED MSG [AGENT [jonh] 07:36:09] AT  [07:36:19]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:10] RECEIVED MSG [AGENT [jonh] 07:36:19] AT  [07:36:29]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:30] RECEIVED MSG [AGENT [jonh] 07:36:30] AT  [07:36:40]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:36:20] RECEIVED MSG [AGENT [mike] 07:36:20] AT  [07:41:20]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:41:20] RECEIVED MSG [AGENT [jonh] 07:41:20] AT  [07:41:30]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:36:40] RECEIVED MSG [AGENT [mike] 07:41:20] AT  [07:46:20]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 07:46:20] RECEIVED MSG [AGENT [jonh] 07:46:20] AT  [07:46:30]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:41:31] RECEIVED MSG [AGENT [mike] 07:46:20] AT  [07:51:20]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 07:51:20] RECEIVED MSG [AGENT [jonh] 07:51:20] AT  [07:51:30]

I first start "jonh agent" and responding every 10 seconds, but when I started "mike", all the queue stops .

I expect to have in master 1 message from mike followed by 30 messages from jonh , but only one happened.

This is the code I've runned for this test

https://gist.github.com/toni-moreno/bc4c1a05973923c7c011d29300a1c1b5

You could start master and agents with this commands :

master

go run pipeline.go master tcp://127.0.0.1:40899 2

agents

go run pipeline.go agent tcp://127.0.0.1:40899 jonh 10
go run pipeline.go agent  tcp://127.0.0.1:40899 mike 300

There is any missing option to configure the outgoing queue to make this work as I spect ? Or is a bug?

Thank you very much in advance!

@toni-moreno
Copy link
Author

toni-moreno commented Mar 27, 2020

Hi , I have test again changing the time that the slowest is taking in finish the job.

In this case master with 5 wokers, jonh taking 10 seconds and mike 30 secons.

go run pipeline.go master tcp://127.0.0.1:40899 5

agents

go run pipeline.go agent tcp://127.0.0.1:40899 jonh 10
go run pipeline.go agent  tcp://127.0.0.1:40899 mike 30

Expected results, are 3 responses from jonh while waiting for mike, until the test stops, but it did only twice, after that slowest are blocking the fastest job. (NOTE: I've started Mike 30 seconds after John)

user@ubuntu00:~/o/task_queue_poc/mangos/examples/pipeline$ go run pipeline.go master tcp://127.0.0.1:40899  5
TRIGGER WORKER[0] Init
TRIGGER WORKER[1] Init
TRIGGER WORKER[2] Init
TRIGGER WORKER[3] Init
TRIGGER WORKER[4] Init
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:17:57] AT  [11:18:07]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:07] AT  [11:18:17]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:17] AT  [11:18:27]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:27] AT  [11:18:37]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:17:53] RECEIVED MSG [AGENT [jonh] 11:18:37] AT  [11:18:47]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:18:07] RECEIVED MSG [AGENT [jonh] 11:18:47] AT  [11:18:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:18:27] RECEIVED MSG [AGENT [mike] 11:18:27] AT  [11:18:57]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:18:17] RECEIVED MSG [AGENT [jonh] 11:18:57] AT  [11:19:07]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:18:37] RECEIVED MSG [AGENT [jonh] 11:19:07] AT  [11:19:17]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:18:57] RECEIVED MSG [AGENT [jonh] 11:19:17] AT  [11:19:27]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:18:47] RECEIVED MSG [AGENT [mike] 11:18:57] AT  [11:19:27]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:07] RECEIVED MSG [AGENT [jonh] 11:19:27] AT  [11:19:37]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:19:27] RECEIVED MSG [AGENT [jonh] 11:19:37] AT  [11:19:47]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:37] RECEIVED MSG [AGENT [jonh] 11:19:47] AT  [11:19:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:18:58] RECEIVED MSG [AGENT [mike] 11:19:27] AT  [11:19:57]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:19:57] RECEIVED MSG [AGENT [jonh] 11:19:57] AT  [11:20:07]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:20:08] RECEIVED MSG [AGENT [jonh] 11:20:08] AT  [11:20:18]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:19:17] RECEIVED MSG [AGENT [mike] 11:19:57] AT  [11:20:27]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:20:28] RECEIVED MSG [AGENT [jonh] 11:20:28] AT  [11:20:38]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:19:28] RECEIVED MSG [AGENT [mike] 11:20:27] AT  [11:20:57]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:20:58] RECEIVED MSG [AGENT [jonh] 11:20:58] AT  [11:21:08]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:19:47] RECEIVED MSG [AGENT [mike] 11:20:57] AT  [11:21:27]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:21:28] RECEIVED MSG [AGENT [jonh] 11:21:28] AT  [11:21:38]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:19:58] RECEIVED MSG [AGENT [mike] 11:21:27] AT  [11:21:57]
TRIGGER WORKER[1] SENDED  MSG [worker[1] - 11:21:58] RECEIVED MSG [AGENT [jonh] 11:21:58] AT  [11:22:08]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:20:18] RECEIVED MSG [AGENT [mike] 11:21:57] AT  [11:22:27]
TRIGGER WORKER[0] SENDED  MSG [worker[0] - 11:22:28] RECEIVED MSG [AGENT [jonh] 11:22:28] AT  [11:22:38]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:20:38] RECEIVED MSG [AGENT [mike] 11:22:27] AT  [11:22:57]
TRIGGER WORKER[2] SENDED  MSG [worker[2] - 11:22:58] RECEIVED MSG [AGENT [jonh] 11:22:58] AT  [11:23:08]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:21:08] RECEIVED MSG [AGENT [mike] 11:22:57] AT  [11:23:27]
TRIGGER WORKER[4] SENDED  MSG [worker[4] - 11:23:28] RECEIVED MSG [AGENT [jonh] 11:23:28] AT  [11:23:38]
TRIGGER WORKER[3] SENDED  MSG [worker[3] - 11:21:38] RECEIVED MSG [AGENT [mike] 11:23:27] AT  [11:23:57]

@toni-moreno toni-moreno changed the title [BUG] On Req/Rep The slowest (>1minute?) agent block shared message queue. [BUG] On Req/Rep The slowest agent block shared message queue. Mar 28, 2020
@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

I confirm that I see the same behavior. This is clearly a bug.

@gdamore gdamore added the bug Something isn't working label Mar 29, 2020
@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

So it looks like it isn't taking that long to go out. Rather the problem is that bit more subtle. We complete the send, and then the underlying pipe is marked available for sending again. The TCP buffering creates a problem here. The requests are getting back logged against that slow pipe -- they've already been scheduled there.

Basically, at the heart of it, TCP doesn't give us an adequate idea of the flow control -- we really need to have some idea from the peer that the receiver is actually ready to work on the job (by having a worker blocking in Recv().

Having said that, I'm still researching, because I don't think this completely explains the behavior.

@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

I think fixing this is going to require further protocol changes, unfortunately. :/

@gdamore gdamore changed the title [BUG] On Req/Rep The slowest agent block shared message queue. slowest rep worker blocks faster workers Mar 29, 2020
@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

I believe I can mitigate this somewhat, if not entirely fixing it. Basically, when we receive a message from REP peer, we can prioritize that pipe for the next send. That means that a slow pipe can cause one more message to get queued, but it should not cause more than that. And we should prefer to avoid that pipe if other pipes are moving along.

@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

So my change in branch bug195 seems to fix this... but it comes at the possible of risking starvation for slow workers. A fast worker will continue to get jobs and get stuck at the head of the queue. That's probably a good thing actually.

@gdamore
Copy link
Contributor

gdamore commented Mar 29, 2020

Deep TCP buffers are still a real problem. I'm not really sure what to do about it in the short term -- naive approaches won't work in the face of fan out designs using devices or systems with many workers sharing a single socket. We an upper layer flow control mechanism. Still that branch seems to make things vastly better. There are a few jobs that penalized badly by getting stuck behind a slow worker but the rest move along.

If you've got a design that requires a mix very long and very short workers, you might want to consider some changes to the approach:

  1. Break the jobs up onto different sockets if you can precharacterize them.
  2. Use an upper layer queueing design -- so your work request is "submit a job", but you get a response back saying it was queued with a job ID. Then you collect the status later.
  3. Consider inverting the logic, where the REQ is actually trying to get the job, and the REP provides the work. When the REQ is done, it can post the result back to the REP socket as another request.

None of these are particularly elegant, I understand. Factoring in consideration of this should be part of what we consider in any new protocol design. (I am considering ways to enhance the protocol as the current one is limited in too many ways.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants