Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency profile of broadcast vs mpmc #27

Open
jonathanstrong opened this issue Feb 28, 2018 · 7 comments
Open

Latency profile of broadcast vs mpmc #27

jonathanstrong opened this issue Feb 28, 2018 · 7 comments

Comments

@jonathanstrong
Copy link

In latency benchmarks for scenarios with many-producers, and 4-12 consumers, I found the broadcast queue to have much worse performance at the tail end than separate mpmc queues that producers send to consecutively.

This "parallel mpmc" pattern works like this (rough example):

let mut receivers = Vec::new();
let mut senders = Vec::new();

for _ in 0..N_CONSUMERS {
    let (tx, rx) = mpmc_queue(QUEUE_SIZE);
    senders.push(tx);
    receivers.push(rx);
}

for _ in 0..N_PRODUCERS {
    let my_senders = senders.iter().map(|tx: MPMCSender<_>| tx.clone());
    // launch thread...
}

// later, in producer thread...
let item = // ...
for tx in &my_senders {
    tx.try_send(item.clone()).unwrap();
}

// in consumer thread
for msg in rx.try_iter() {
    // act on messages...
}

Broadcast followed the doc examples, including calling unsubscribe on the unused BroadcastReceiver, but consumers are reading a new_stream from the same queue instead of their own personal queue.

In a 40 producer, 6 consumer test, the two setups have nearly identical latency profiles up to the 99.9% lie. But after that, the broadcast queue fares much worse:

           broadcast   par mpmc
           ---------------------
99.99%     307u        85u
99.999%    4.7ms       593u
99.9999%   8.7ms       600u*

* one 9ms outlier

image

image

Caveats:

  • I'm fairly new to measuring latency, I could have made an error. The way latency was calculated was by sending a time over the queue/channel and comparing it to the current time when the consumer received it. The delta in nanoseconds (or 0, if measured duration was < 0ns) was recorded with a histogram from the hdrhistogram crate.
  • I was specifically trying to measure infrequent, sporadic sends. In between sends, the producer threads would sleep for a random duration between 1u and 500u. Total messages sent over the lifetime of the tests above was round 50k.
  • The main system I ran the tests on was not tuned for latency, and I was not careful about minimizing activity on the server during the tests. However on other machines, including some with realtime kernels and without other activity, showed similar results.

Questions

  • Do you have any thoughts on why this discrepancy exists?
  • Did the design anticipate dozens of producer handles/threads? Would that cause any problems you foresee?
  • Do you have any suggestions for minimizing worst case latency for a many-producer, low-throughput scenario?

Thanks!

@schets
Copy link
Owner

schets commented Mar 1, 2018

Can you post the entire code? This is somewhat expected on an oversubscribed system due to the notification method differences that must happen between broadcast and mpmc.

The broadcast is mostly built for systems where the consumers have dedicated cores and are busy waiting

@jonathanstrong
Copy link
Author

Yes, will do. It will take a day or so since need to pull out some proprietary code.

In the meantime, I have done a lot more testing, and it's fair to say results are all over the map.

These are results for the same two approaches, but on a VPS w/ 4 "dedicated" cores (8 w/ hyperthreads), lowlatency kernel, isolcpus=4,5,6,7, and two consumers pinned to the two isolated cores:

image

image

So, slightly better performance from broadcast.

Both had cataclysmic interrupts on one occasion over two hours (3ms, 9ms), which aren't included. I selected a calm hour for each (I figure the system is at fault there).

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

But the biggest takeaway is these tests are very sensitive to the machine they are being run on. It's pretty finicky. More tests = more doubt. Would be interested in any input you can provide on best practices. I'll post the test code when I can untangle it a bit. Thanks!

@schets
Copy link
Owner

schets commented Mar 2, 2018

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

I think this helps explain it! The thing with the broadcast queue is that no writer can ever write past a fallen-behind subscriber (the existence of destructors makes that trickier). With a small queue size, that makes it much easier for a reader to fall behind and block writers.

@jonathanstrong
Copy link
Author

here is a working version of the benchmark code: https://gist.github.com/jonathanstrong/d88d474f5b0ea147e5bc98d9df0910a2

@ProfFan
Copy link

ProfFan commented Jul 6, 2018

@jonathanstrong Any chance you can run the latency benchmarks on a system with a -lowlatency kernel? I think that would help eliminate the spikes and add a bit more reliability to the tests.

@jonathanstrong
Copy link
Author

@ProfFan apologies, should have mentioned it: tests were run on the lowlatency kernel, Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-128-lowlatency x86_64). If there are any other system-level improvements you would suggest, would definitely be interested, good information is hard to find.

@ProfFan
Copy link

ProfFan commented Jul 6, 2018

@jonathanstrong Debian has a linux-image-rt which is the kernel patched with PREEMPT-RT patches. (See https://packages.debian.org/stretch/linux-image-rt-amd64)

I think this would be pretty interesting especially for robotics and embedded developers (like me 😄 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants