Replies: 5 comments 9 replies
-
Before porting the system to a new networking setup, my advice would be to exactly replicate our numbers on a setup identical to ours, which can be done via your free CloudLab account, on one of their 100Gb RDMA clusters. I realize that your work isn’t really being done on CloudLab, but the idea is to change one thing at a time. Right now you are changing many things all at the same time: in such situations tuning is needed, and you don’t yet know the system well enough to set the relevant configuration parameters. Here are some factors to think about:
With direct 1KB multicast from a member to the other members, over RDMA, with adequate resources and an idle machine, this specific setup should yield numbers that match the Spindle paper. Next, reconfigure to initiate the multicasts from a member of the top level group, but not inside the same subgroup. This forces relaying and will let you measure the delay introduced by the indirection associated with relaying. Because your own work doesn’t include any form of relaying, this is more to understand that if you set up a client to be external to the destination subgroup, you are deliberately placing delay on the critical path. That may be appropriate in many settings, but wouldn’t be a fair comparison for your work. Finally, you might try again with an external client, not a member of the top level group. Here you will see the highest overheads of all. As a comment, I suspect that this is the case you measured in your experiment described above. As you can see, it is an easy thing to accidentally force Derecho into a less efficient communication pattern, which will bring costs that your own work might be avoiding simply by doing a more direct experiment. In fact, the real comparison you want is probably our RDMA multicast, from a sender in the subgroup, versus your network ordered Paxos, also from a sender in the subgroup. Now, having done all of this on CloudLab, modify the config file to put Derecho into tcp mode (change the LibFabrics provider option). Rerun the member of a subgroup to the other members experiment. This time you should see the lower latencies you were expecting. If not, that might point to a bug: I’m not certain Lorenzo retested with tcp after doing the Spindle optimizations, and sometimes a change that speeds up RDMA performance is somehow less effective with tcp. But Lorenzo is careful and honestly, that would surprise me. At this stage, one last step would be to port this exact setup back to your research cluster, still using tcp. Compare Derecho on 100Gb CloudLab tcp to the numbers on your cluster. How big a hit did it take? That would be due to your cluster having a different and perhaps slower tcp network and stack. But it would still be a reasonable comparison to publish, if you also note that with RDMA, Derecho latency would be sharply lower and throughout would be much higher. You will find that the reviewers of a paper like yours are more willing to forgive truth than any kind of distortions. So in particular, if you have better numbers on tcp with your router hardware feature and your version of RaFT, but worse numbers when Derecho can run with RDMA in similar nodes, you would just make the case in your paper that many clusters lack RDMA, but might be able to use your technology in the router, and that because you do better on pure tcp plus with your feature, it represents an exciting option for developers who lack RDMA but do have the ability to deploy router upgrades. If it was me reviewing the paper, I would be open minded about that — not every paper sets a totally new performance record. |
Beta Was this translation helpful? Give feedback.
-
You need @ellerre (Lorenzo Rosa) to comment on that latency, but because Spindle optimizes the whole system for throughout and views low latency as a secondary consideration, using opportunistic batching throughout, it certainly could increase latency during tests that send a stream of messages, and I think the ms unit is probably intended. We don’t know of many situations where latency for a single isolated multicast would be the primary concern. if you do go down that route, check the work done by Joe Israelivitz at University of Colorado on a protocol called Acuerdo, if that paper has been published (I saw a prepublication version a while back). Acuerdo was looking at exactly this issue of delay for individual Paxos multicasts. Honestly, while I do believe there are settings that might need a low latency one-time atomic multicast, I don’t see that as a common scenario. More often you are dealing with streams of events. This said, Spindle definitely does increase latency due to its batching, and we accepted that because our users turn out to care a lot about using the bandwidth capacity of their clusters, so they care about throughput even more than latency. You can legitimately use this as a contrast to the priorities for your work, but be careful to justify your choices with real use cases that actually demand ultra low latency for isolated multicasts. One more remark: Lorenzo wasn’t actually trying to test our latency for an individual multicast, and is designed to trigger batching. To get a fair number, you may have to slow the test down, maybe 1 or 2 multicasts per second or something. Ignore the first few (cold caches) but once the pathway warms up you should start to see latencies with none of the Spindle batching features kicking in. |
Beta Was this translation helpful? Give feedback.
-
Hello @Steamgjk , thank you for your interest in the work. Let me add some further considerations. First, about Derecho on Libfabric on TCP being slow. I had never tested that configuration extensively, but I feel we cannot do much more to improve performance on that side. Derecho is built to take advantage of RDMA features and, as you pointed out, Libfabric supports RDMA over TCP mainly for testing and compatibility. Currently, I think that only DPDK (or other forms of accelerations like XDP) can achieve an RDMA-comparable performance on non-RDMA networks, and we are actively working to add DPDK support to Derecho. Probably your best chance to improve these numbers in a short time, as Ken suggested, is to try a userspace-level TCP implementation. For example, mTCP which in turn uses DPDK to optimize the data path. Then, about the issues in reproducing Spindle RDMA results. Probably using the 100 Gbps NIC could suffice to fix the gap between the paper results and what you are getting now. You should also check that, if you are running on a multi-code node, you are pinning all the Derecho threads to execute on the same NUMA node. Another possible source of overhead is that on CloudLab we have little control on the network topology. You may end up with nodes that are physically very distant, adding more network hops wrt our internal testbed. Using IB or RoCE should not make much difference. Also, I think you are not considering the right Spindle pictures. Fig. 4 and 5 in Spindle show only a partial result, considering only batching and not the other optimizations (yet, opportunistic batching is shown to even improve latency). What one gets by running Derecho is the whole set of optimizations, including null-sends and smart locking, which are evaluated and shown in the last pictures (16 and 17). You are using a different message size, but it should be feasible to estimate the packets per seconds from that. Finally, Spindle results were obtained more than 2 years ago (Jan 2020), so it is possible that something has changed, even though I don’t think performance is significantly worse. Feel free to contact me by email if you need the specific Spindle configuration files. You could also look at our DerechoDDS paper, which shows more recent (July 2021) performance measurements on a 40Gbps CloudLab NIC. |
Beta Was this translation helpful? Give feedback.
-
I completely agree with Lorenzo (including that we would welcome a contribution from you, if you find an angle to speed up the system on TCP network deployments!) Derecho itself is delivering messages as soon as it is safe to do that, using information-sharing paths that are as short as it is possible to make them. So if you want consensus guarantees (aka Paxos, State Machine Replication, Dynamic Uniform Agreement — all mean almost exactly the same thing), you aren’t going to make the data paths in the underlying algorithm shorter. The angles worth exploring, I think, center on reliable data transport rather than multicast ordering, but there might be an ordering opportunity too. For transport, the system currently has point to point and multicast, and the multicast is done using one protocol for big messages and a different one for small messages. The big message protocol, RDMC, isn’t tuned for TCP (or DPDK, or XDP). But you won’t get an SOSP paper from tuning RDMC, unless you can identify a kind of deep opportunity or insight that we aren’t thinking at all about. And this could actually be possible — these days people often want crypto “on the wire”, and there are works that use erasure coding to reduce the amount of TCP retransmission (RDMA is really basically TCP, but using credits to eliminate 99.999% of the packet drops by ensuring that when a sender sends, there is space available for the incoming messages on each router hop and each destination. Then the NIC implements ack/nack/congestion handling totally differently because those drops are so rare). So integration with FPGA or other accelerators could be exciting. This still is a space for innovation, I think. … so one could maybe do a datacenter transport, call it DCP, that would plug into Derecho using the SMC and RDMC API and leverage clever tricks to greatly improve performance in TCP settings, leverage hardware like bump-in-the-wire crypto, etc. I think this would be quite interesting. The other angle worth thinking about is the null send mechanism. Derecho is sending actual null multicasts if a sender isn’t ready, and with RDMA latency and speeds, that strategy is a good one. But with high latency and lower speeds, the delay to discover that everyone is waiting for some process to send a null multicast is going to be an issue and we wouldn’t be running at the highest possible speeds. In fact in this one aspect, Derecho’s optimality proof doesn’t apply. Yet this is central to performance. So, I think one could do a lot at that level, in terms of how many nulls to send when sending a null (if a process isn’t going to send in this round, maybe it should send a message “ignore me for 2 rounds”, or it could be 4, 8, etc. Then of course we need to somehow ensure that if it suddenly does need to send but by now, but everyone else is idle, that even if it last said “skip me for 256 rounds” it still should have a way to send its message promptly and have it delivered promptly. I see a lot more that could be done in this layer of the solution. A higher latency transport definitely might push us into domains of behavior where that would be needed. Networking over cellular 5G is especially interesting. I’m not so crazy about ideas to run on unreliable transports. Derecho currently requires a reliable lower layer: fifo, lossless and without duplications. I don’t see much of an advantage for modifying Derecho at a higher level to work around lower level unreliability. The other thing I really don’t see would be any value to switching to a Paxos with a majority-write, quorum-read policy. Our system assumes that every write reaches every receiver, and this really does pay off for speedy reads. So although Acuerdo has a more standard Paxos and actually does outperform Derecho (build on the same lower layers!) for patterns in which some servers are very laggy, I really prefer Spindle’s opportunistic batching as the way to handle those behaviors (Acuerdo was never compared against the Spindle version of Derecho as far as I know). But, the Acuerdo authors had a lot of trouble finding a way to publish their paper, so apparently the NSDI and SIGCOMM folks weren’t excited by that topic either. Or maybe the paper was just not polished enough, or the experiments somehow weren’t convincing enough. Personally, I really liked the draft I was shown. But even so, I think that opportunistic batching is inherently a better match to the balky server scenarios, and I bet we could equal or beat Acuerdo on the same workload using the Spindle optimizations. One last direction of interest would be time sensitive or real-time scenarios where low latency is the top priority rather than throughout. We haven’t really studied this carefully and it isn’t clear to me whether we are missing any major opportunities as a result. As mentioned a while back, the experiment to run here would simply send one multicast at a time with delays between them, so we could have all our batching features enabled but the system would never see any opportunity to use batching. And now the question to ask would be: how close are we to the hardware minimal latencies? Is there a sense in which RDMC and SMC aren’t properly optimized or even aren’t designed properly for this case? Very interesting open topic. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the insightful comments and help. RDMA, All Sender with 1000B msgs, the total throughput reaches 626K msgs/sec with One sender, the throughput is about 313K msgs/sec After I change the message size to 100B, the throughput remains at a high level (even slightly better), with All-sender, 634K msgs/sec; One-Sender: 305K msg/sec Then after I change to TCP backend, All-Sender throughput is about 17.4K msg/sec, One-Sender throughput is 5.6K msg/sec Then I move back to GCP: All-Sender throughput is about 16.8K msg/sec, One-Sender throughput is 5.0K msg/sec I think I have come to the conclusions:
For now, I think I have got enough performance number for my reference. Thanks for the help!. |
Beta Was this translation helpful? Give feedback.
-
Hi, Derecho Staff.
I conduct a simple test to measure the atomic multicast latency of Derecho in public cloud.
simple_replicated_objects_json-cpp.zip
I launch 3 VMs in Google Cloud and they form a subgroup. I use libfabric (TCP) for communication.
I use ramfs to store data in memory and use versioned data. I submit 10000 requests from one of the 3 VMs with ordered_send, then I measure the latency to commit the request (replicate to all processes).
Then I find that the median latency can reach 2500us. This is larger than I expected, because I measured the inter-VM latency, which is 50-100us. I think it only takes 2 message delays from ordered_send to get the reply, so it should be about 200-300us to complete the replication of one request. The request is just small-sized one (a string "myChange-0-xxxx").
I checked Fig 13(b) in the TOCS paper, the atomic multicast latency for 1B message (group size of 3) is between 100~1000us, so it may be hundreds of us (300us or so). Since the figure is done in 100Gbps network with bare metal machines, I feel reasonable that my latency in GCP will be larger. But I am just curious whether there can be such a gap between the Derecho multicast latency (2500us) and the inter-VM latency (50-100us). Is libfabric so weak?
Since Derecho get the multicast latency of 300us or so, what is the inter-machine latency (ping) in Fractus ? is there also a big gap between these two?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions