About Derecho performance in public cloud #237

Steamgjk · 2022-04-05T18:53:45Z

Steamgjk
Apr 5, 2022

Hi, Derecho Staff.
I conduct a simple test to measure the atomic multicast latency of Derecho in public cloud.
simple_replicated_objects_json-cpp.zip

I launch 3 VMs in Google Cloud and they form a subgroup. I use libfabric (TCP) for communication.
I use ramfs to store data in memory and use versioned data. I submit 10000 requests from one of the 3 VMs with ordered_send, then I measure the latency to commit the request (replicate to all processes).

Then I find that the median latency can reach 2500us. This is larger than I expected, because I measured the inter-VM latency, which is 50-100us. I think it only takes 2 message delays from ordered_send to get the reply, so it should be about 200-300us to complete the replication of one request. The request is just small-sized one (a string "myChange-0-xxxx").

I checked Fig 13(b) in the TOCS paper, the atomic multicast latency for 1B message (group size of 3) is between 100~1000us, so it may be hundreds of us (300us or so). Since the figure is done in 100Gbps network with bare metal machines, I feel reasonable that my latency in GCP will be larger. But I am just curious whether there can be such a gap between the Derecho multicast latency (2500us) and the inter-VM latency (50-100us). Is libfabric so weak?

Since Derecho get the multicast latency of 300us or so, what is the inter-machine latency (ping) in Fractus ? is there also a big gap between these two?

Thanks.

KenBirman · 2022-04-05T21:17:40Z

KenBirman
Apr 5, 2022
Maintainer

Before porting the system to a new networking setup, my advice would be to exactly replicate our numbers on a setup identical to ours, which can be done via your free CloudLab account, on one of their 100Gb RDMA clusters. I realize that your work isn’t really being done on CloudLab, but the idea is to change one thing at a time. Right now you are changing many things all at the same time: in such situations tuning is needed, and you don’t yet know the system well enough to set the relevant configuration parameters.

Here are some factors to think about:

Are you measuring performance from a member of a subgroup to the other members, with our built in RDMA multicast?
Is the system using SMC or RDMC multicast (determined by your choice of message size, and the relevant control parameter from the config file)? You probably want SMC for you experiments.
Have you configured Linux to give us the needed amount of memory (I would recommend 2 or more GB), and cores (I recommend using taskset to assign us 2 or more cores)?

With direct 1KB multicast from a member to the other members, over RDMA, with adequate resources and an idle machine, this specific setup should yield numbers that match the Spindle paper.

Next, reconfigure to initiate the multicasts from a member of the top level group, but not inside the same subgroup. This forces relaying and will let you measure the delay introduced by the indirection associated with relaying. Because your own work doesn’t include any form of relaying, this is more to understand that if you set up a client to be external to the destination subgroup, you are deliberately placing delay on the critical path. That may be appropriate in many settings, but wouldn’t be a fair comparison for your work.

Finally, you might try again with an external client, not a member of the top level group. Here you will see the highest overheads of all. As a comment, I suspect that this is the case you measured in your experiment described above. As you can see, it is an easy thing to accidentally force Derecho into a less efficient communication pattern, which will bring costs that your own work might be avoiding simply by doing a more direct experiment. In fact, the real comparison you want is probably our RDMA multicast, from a sender in the subgroup, versus your network ordered Paxos, also from a sender in the subgroup.

Now, having done all of this on CloudLab, modify the config file to put Derecho into tcp mode (change the LibFabrics provider option). Rerun the member of a subgroup to the other members experiment. This time you should see the lower latencies you were expecting. If not, that might point to a bug: I’m not certain Lorenzo retested with tcp after doing the Spindle optimizations, and sometimes a change that speeds up RDMA performance is somehow less effective with tcp. But Lorenzo is careful and honestly, that would surprise me.

At this stage, one last step would be to port this exact setup back to your research cluster, still using tcp. Compare Derecho on 100Gb CloudLab tcp to the numbers on your cluster. How big a hit did it take? That would be due to your cluster having a different and perhaps slower tcp network and stack. But it would still be a reasonable comparison to publish, if you also note that with RDMA, Derecho latency would be sharply lower and throughout would be much higher.

You will find that the reviewers of a paper like yours are more willing to forgive truth than any kind of distortions. So in particular, if you have better numbers on tcp with your router hardware feature and your version of RaFT, but worse numbers when Derecho can run with RDMA in similar nodes, you would just make the case in your paper that many clusters lack RDMA, but might be able to use your technology in the router, and that because you do better on pure tcp plus with your feature, it represents an exciting option for developers who lack RDMA but do have the ability to deploy router upgrades. If it was me reviewing the paper, I would be open minded about that — not every paper sets a totally new performance record.

1 reply

Steamgjk Apr 5, 2022
Author

Thanks, Ken. @KenBirman These are really helpful suggestions.

First, let me anwser the 3 questions:
Are you measuring performance from a member of a subgroup to the other members (Yes), with our built in RDMA multicast? (No, I am using the libfabric (tcp-based) multicast)
Is the system using SMC or RDMC multicast (determined by your choice of message size, and the relevant control parameter from the config file)? You probably want SMC for you experiments. (SMC, because max_smc_payload_size=10240 in the cfg file and my message is just a few bytes)
Have you configured Linux to give us the needed amount of memory (I would recommend 2 or more GB), and cores (I recommend using taskset to assign us 2 or more cores)? (Both CPU and memory are sufficient. I am using n1-standard-16 VMs, each with 16 vCPUs and 60GB memory, I mount 8GBs on each VM for ramfs)

Following your suggestion, I think I should evaluate and check the following cases.

3 nodes, with sufficient CPU and memory, With RDMA enabled

Case 1. (RDMA) Use 1KB message size, use SMC, from one node ordered_send to others. Check the latency with the Spindle paper (One quick question, Figure 5 in Spindle paper says the latency is ms-level, shouldn't that be microsecond-level? I know there is batching strategy used, but will that affect so much?)
Case 2. (RDMA) Same setting as Case 1, but add one more node in the top-level group, from that node ordered_send to the others, check the indirection overheads, by comparing with Case 1.
Case 3. (RDMA) Try from an external client, submit requests to one of the Derecho nodes, that Derecho node does the ordered_send and gives acknowledge to the external client. This will introduce two addional message delays (client->Derecho and Derecho->client). Actually, I feel this is the final scenario I want, because I have measured all my baselines from the perspective of clients (from client submission to reciving the acknowledgement from the replicas), but this is not a big deal, the two message delay overheads can be easily quantified.
Case 4. (RDMA) Rerun Case 1 in CloudLab, with 100-Byte mesage size (Because I have run all my baselines with request sizes of ~100bytes, my final comparison will also use 100Bytes, so I'd better check first how much will differ between 100Byte experiments and 1KB experiments)
Case 5: (TCP) Rerun Case 4 in CloudLab, with libfabric, compare with Case 4.
Case 6: (TCP) Rerun Case 4 in GCP, with libfabric, compare with Case 5.

I will prioritize Case 1->Case 4->Case 5->Case->6. The indirection (Case 2) and external client delay (Case 3) can be easily quantified. The key issue for now is the order_multicast delay.

I will keep posting the new results here and we can continue the discussion.

KenBirman · 2022-04-05T23:28:54Z

KenBirman
Apr 5, 2022
Maintainer

You need @ellerre (Lorenzo Rosa) to comment on that latency, but because Spindle optimizes the whole system for throughout and views low latency as a secondary consideration, using opportunistic batching throughout, it certainly could increase latency during tests that send a stream of messages, and I think the ms unit is probably intended. We don’t know of many situations where latency for a single isolated multicast would be the primary concern.

if you do go down that route, check the work done by Joe Israelivitz at University of Colorado on a protocol called Acuerdo, if that paper has been published (I saw a prepublication version a while back). Acuerdo was looking at exactly this issue of delay for individual Paxos multicasts.

Honestly, while I do believe there are settings that might need a low latency one-time atomic multicast, I don’t see that as a common scenario. More often you are dealing with streams of events. This said, Spindle definitely does increase latency due to its batching, and we accepted that because our users turn out to care a lot about using the bandwidth capacity of their clusters, so they care about throughput even more than latency. You can legitimately use this as a contrast to the priorities for your work, but be careful to justify your choices with real use cases that actually demand ultra low latency for isolated multicasts.

One more remark: Lorenzo wasn’t actually trying to test our latency for an individual multicast, and is designed to trigger batching. To get a fair number, you may have to slow the test down, maybe 1 or 2 multicasts per second or something. Ignore the first few (cold caches) but once the pathway warms up you should start to see latencies with none of the Spindle batching features kicking in.

6 replies

KenBirman Apr 8, 2022
Maintainer

Was the code compiled with -O3, and with the debug logging feature disabled?

KenBirman Apr 8, 2022
Maintainer

More thoughts (I am out of the office, and using an iPad, so I’ll be brief)

First, the object case is surely due to marshaling. In fact if you profile a serialization module of any kind at all, because of all the copying (and perhaps byte order swapping!) performance will be at best limited to what memcpy can give, which is often in the 4-6GB/s range. Later we need to deserialize. 100G RDMA would move data at 12.5GB/s on IB, nearly 6x faster. So, on 100G the serialization will dominate by a factor of about 6 in a 100G setting and about 2 in your 25G scenario. Constructing the object in place will help, but you still have to think about how the receiver side would work. So, raw is going to win, and raw with in place construction wins the most.

Next comment: Lorenzo has begun experimenting with urdma.h, a library that maps verbs to DPDK, which has a user mode TCP stack. Or you could layer LibFabrics itself on a user mode TCP. We think LibFabrics isn’t a source of overhead: C++ 17 will in-line the method calls with -O3. But a kernel TCP stack will mean copying… and again, copying will kill our performance.

by now I’m wondering what Derecho could do over your network ordered routers! If you have a basic ordered and reliable API it might not be hard to try this.

Steamgjk Apr 9, 2022
Author

Was the code compiled with -O3, and with the debug logging feature disabled?

The above test disables debug, but not use -O3.
After I use -O3 and retest, it is still 133K req/sec

Besides, I also test the all-sender case

Each node earns about 108K req/sec, so the total throughput is about 324K req/sec

In Spindle paper, the all-sender with 3 nodes are about 600K msg/sec.

So, now for both one sender and all-sender, I get half of the performance compared with Spindle paper, i.e.
133K VS 25K ; 324K VS. 600K

For now the major difference is the hardware, 25G RoCEv2 VS 100G IB

Besides, I am not sure whether this is a bug with the all-sender program, the version cannot go beyond 1059K. After that, it will throw PERSIST_EXP_NO_SPACE_LOG (I have mounted 40G ramfs for each node, so I think it is enough)

I post this issue here #238

Steamgjk Apr 9, 2022
Author

More thoughts (I am out of the office, and using an iPad, so I’ll be brief)

First, the object case is surely due to marshaling. In fact if you profile a serialization module of any kind at all, because of all the copying (and perhaps byte order swapping!) performance will be at best limited to what memcpy can give, which is often in the 4-6GB/s range. Later we need to deserialize. 100G RDMA would move data at 12.5GB/s on IB, nearly 6x faster. So, on 100G the serialization will dominate by a factor of about 6 in a 100G setting and about 2 in your 25G scenario. Constructing the object in place will help, but you still have to think about how the receiver side would work. So, raw is going to win, and raw with in place construction wins the most.

Next comment: Lorenzo has begun experimenting with urdma.h, a library that maps verbs to DPDK, which has a user mode TCP stack. Or you could layer LibFabrics itself on a user mode TCP. We think LibFabrics isn’t a source of overhead: C++ 17 will in-line the method calls with -O3. But a kernel TCP stack will mean copying… and again, copying will kill our performance.

by now I’m wondering what Derecho could do over your network ordered routers! If you have a basic ordered and reliable API it might not be hard to try this.

I total agree with the First point.

For the second point, I still feel libfabric should take some responsiblity to the low performance of TCP.

https://man.archlinux.org/man/community/libfabric/fi_tcp.7.en "The provider is not intended to provide performance improvements over regular TCP sockets, but rather to allow developers to write, test,and debug application code even on platforms that do not have high-performance fabric hardware."

I feel Libfabric is not designed to benefit TCP, it is mainly for RDMA, and tries to make RDMA-style APIs for Socket programming. It is more like adding an extra layer above the raw socket programming, I am not sure whether there are extra copies or buffer reorganization involved. But I believe there is no free lunch: If the programming becomes easier or compatibility becomes stronger, then the performance must have been sacrificed.

I did a simple test, with raw socket TCP. There is one client and server. The client just sends 1K messages to the server and server periodically acks an integer indicating how much data has been received.
client-cpp.txt
server-cpp.txt

Then I find that in CloudLab, my single-thread TCP can reach about 1.1 GB/s. (although kernel stack has copies, the performance is not so bad, maybe using DPDK will achieve even higher)

On GCP cluster, it reaches 30MB/s

I know it is meaningless to directly compare this number with the libfabric throughput I have tested, because my bench is simply a ping-pong without any logic. But the libfabric (TCP) throughput can only reach 5K msg/sec. that is 5MB/s, the gap is so large, that makes me beleive that libfabric should perform much worse than raw socket.

But anyway, libfabric has made great contribution to help people program RDMA easily, maybe I should not blame it too much.

For the third point, actually I am not sure whether your are referring to my work, or the work from another team ( I heard they are enforcing some ordering on switch (maybe that is what you mean by router), but I don't know the details so far). There are some similarities, but as for my work, I am trying to build network ordering in a more general and deployable way (e.g. in public cloud), that network is completely blackbox. So my solution is to implement in software endhosts. In my design, it has a (somewhat) similar role to the router called proxy, which is stateless (and thus highly scalable). Proxies and replicas work together to build the ordering and finally consistency.

I feel Derecho's performance can be further boosted by the proxies (routers). Here is a rough idea:
In the current design, the derecho nodes are undertaking both ordering and replication tasks. Although RDMA gives Derecho node strong power, the power is still limited and has an upper bound. My proxies can help to improve this upper bound, by undertaking the heavy sync tasks for Derecho nodes and freeing Derecho's labor, so the Derecho nodes can use their interconnection to do more work (i.e. handle more throughput).

With the network ordered proxies, the replication and part of ordering tasks can be offloaded from derecho node to the outer proxies. Imagine there are sufficient proxies deployed outside derecho nodes, and proxies multicast(replicate) requests to each derecho node, and requsts are replicated to each derechno node in approximately the same order. Then, derecho nodes only needs to conduct lighweight index synchronization (with their precious inter-node traffic) to evolve from almost-consistency to complete consistency.

The idea may need more consideration, and we can discuss later.

Now I still need to complete the following steps:
Hopefully (but seems unlikely) I can reserve c6525-100g instances before my deadline. In that way, I can reproduce a performance number closer to the Spindle paper. Otherwise, I think I will continue to use the 25G instances. For now, the all-sender setting should be used as my reference baseline (because external clients can submit to any Derecho replicas). In that way, the throughput is 324K req/sec based on my current tuning (Any further suggestions on improving the number except upgrading clusters?).

After changing to 100B message size (Case 4), I will redo the test (but the number should not vary a lot based on Figure 4 in Spindle paper). My number is about 180K~200K req/sec on GCP, with raw socket (UDP) and protobuf serialization/deserizalation. I need to make a test in CloudLab, maybe it can improve a bit, but I don't think it can reach 324K req/sec. However, it does not matter if I cannot catch up with Derecho.

Then change libfabric from RDMA to TCP (Case 5), I expect to see a great performance drop of Derecho compared with Case 4.

Then I will move back to GCP (Case 6), based on the previous tests, I believe my work can outperform Derecho when Derecho can only use tcp providers for libfabric.

KenBirman Apr 9, 2022
Maintainer

I won’t be able to get to your ideas until tomorrow, because I and spending the day with familu, and won’t have time to focus on them. But you really do need to run on 100G with -O3 to do comparable tests.

Lorenzo has some observations but he mostly agrees that the LibFabric TCP option isn’t a good choice. He is cautiously optimistic about user mode tcp via urdma.h. I’ve urged him to post his comments here…

ellerre · 2022-04-09T11:56:49Z

ellerre
Apr 9, 2022
Collaborator

Hello @Steamgjk ,

thank you for your interest in the work. Let me add some further considerations.

First, about Derecho on Libfabric on TCP being slow. I had never tested that configuration extensively, but I feel we cannot do much more to improve performance on that side. Derecho is built to take advantage of RDMA features and, as you pointed out, Libfabric supports RDMA over TCP mainly for testing and compatibility. Currently, I think that only DPDK (or other forms of accelerations like XDP) can achieve an RDMA-comparable performance on non-RDMA networks, and we are actively working to add DPDK support to Derecho. Probably your best chance to improve these numbers in a short time, as Ken suggested, is to try a userspace-level TCP implementation. For example, mTCP which in turn uses DPDK to optimize the data path.

Then, about the issues in reproducing Spindle RDMA results. Probably using the 100 Gbps NIC could suffice to fix the gap between the paper results and what you are getting now. You should also check that, if you are running on a multi-code node, you are pinning all the Derecho threads to execute on the same NUMA node. Another possible source of overhead is that on CloudLab we have little control on the network topology. You may end up with nodes that are physically very distant, adding more network hops wrt our internal testbed. Using IB or RoCE should not make much difference.

Also, I think you are not considering the right Spindle pictures. Fig. 4 and 5 in Spindle show only a partial result, considering only batching and not the other optimizations (yet, opportunistic batching is shown to even improve latency). What one gets by running Derecho is the whole set of optimizations, including null-sends and smart locking, which are evaluated and shown in the last pictures (16 and 17). You are using a different message size, but it should be feasible to estimate the packets per seconds from that.

Finally, Spindle results were obtained more than 2 years ago (Jan 2020), so it is possible that something has changed, even though I don’t think performance is significantly worse. Feel free to contact me by email if you need the specific Spindle configuration files. You could also look at our DerechoDDS paper, which shows more recent (July 2021) performance measurements on a 40Gbps CloudLab NIC.

2 replies

Steamgjk Apr 9, 2022
Author

Thanks, Lorenzo @ellerre I have emailed you for the cfg files.
For the first point, If I remember correctly, mTCP has its own APIs and programming model which is even different from socket programming. How much modification do you think it will cause to the current implementation of Derecho?

BTW, speaking of DPDK, although GCP claims it supports DPDK, I have never seen anyone successfully run that in Google Cloud. (I have tried in multiple ways previously, but I also failed). I would appreciate very much if you have experienced that and can offer some guidelines on deploying DPDK in GCP.

My current issue is:
I am doing a work in public cloud, and I compare my solution with all baselines in public cloud (GCP). Now I intend to add Derecho as one more baseline. We have seen that Derecho can achieve very high performance with RDMA, so I can report a good number for RDMA-based Derecho with bare-metal machine. But for now, as mentioned above, the libfabric is not high-performance for tcp. I am hesitating whether I should report the current tcp performance number (~5K req/sec with one sender) in public cloud (I do not want to hurt the reputation of Derecho because it is an awesome work). Theorectically, Derecho in public cloud can achieve much better performance than the current number if (1) DPDK can be successfully deployed in public cloud (2) DPDK has been successfully integrated into Derecho. But there seems to still be some distance from achieving these two conditions.

Anyway, for now I need to tune the RDMA-based Derecho to the best number and include that in my paper. I would appreciate if you could email me the cfg files (and any tips you think helpful) and bench programs (if possible) so that I can retest it.

ellerre Apr 10, 2022
Collaborator

I summarize here the content of our email exchange, as it could be useful for other people reading this discussion. The benchmark tests we used for Spindle are the raw bandwidth and latency tests. Those have remained substantially unchanged since my evaluation, and we use them as our internal benchmark for the Derechos’ core protocols.

One alternative userspace TCP/IP stack that comes to my mind is VMA from Mellanox, which uses the RoCE NIC to sending Ethernet raw frames (no DPDK). By using LD_PRELOAD, you should be able to use VMA transparently without code changes to Libfabric. But I have never tried it, so I have no first-hand experience about performance or ease of use.

More broadly, however, any form of RDMA emulation on top of TCP/IP will bring overhead, as the socket API inherently requires data copies. Derecho has chosen to optimize its protocols for RDMA and to support TCP via libfabric mainly for testing and compatibility. In fact, Derecho probably achieves the minimum delay before delivery: this is a bound that applies to any Paxos protocol, and at present Derecho is the only atomic multicast with this minimal delay property, so it is not somehow worse than other solutions — it is the best, subject to how the system is tuned. Any solution with a shorter delivery path, in terms of one way node-to-node delays, is guaranteed to violate the Paxos properties and won't correctly support state machine replication. We agree that the current behavior of Derecho in public clouds that only offer TCP is not ideal. That is why we have been recommending that people opt for the "HPC cloud" configurations, that major providers now offer as a way to use RDMA resources in a cloud environment. But clearly this is not an ideal solution as well. As RDMA (but the same applies to DPDK) becomes increasingly popular, it is up to cloud providers to offer users a way to access it as a commodity, just like they do for TCP/IP networking.

By the way, we would always welcome quality suggestions and contributions both to the code and to the possible deployment strategies, e.g., of Derecho on public infrastructures.

KenBirman · 2022-04-10T15:44:29Z

KenBirman
Apr 10, 2022
Maintainer

I completely agree with Lorenzo (including that we would welcome a contribution from you, if you find an angle to speed up the system on TCP network deployments!)

Derecho itself is delivering messages as soon as it is safe to do that, using information-sharing paths that are as short as it is possible to make them. So if you want consensus guarantees (aka Paxos, State Machine Replication, Dynamic Uniform Agreement — all mean almost exactly the same thing), you aren’t going to make the data paths in the underlying algorithm shorter.

The angles worth exploring, I think, center on reliable data transport rather than multicast ordering, but there might be an ordering opportunity too.

For transport, the system currently has point to point and multicast, and the multicast is done using one protocol for big messages and a different one for small messages. The big message protocol, RDMC, isn’t tuned for TCP (or DPDK, or XDP). But you won’t get an SOSP paper from tuning RDMC, unless you can identify a kind of deep opportunity or insight that we aren’t thinking at all about. And this could actually be possible — these days people often want crypto “on the wire”, and there are works that use erasure coding to reduce the amount of TCP retransmission (RDMA is really basically TCP, but using credits to eliminate 99.999% of the packet drops by ensuring that when a sender sends, there is space available for the incoming messages on each router hop and each destination. Then the NIC implements ack/nack/congestion handling totally differently because those drops are so rare). So integration with FPGA or other accelerators could be exciting. This still is a space for innovation, I think.

… so one could maybe do a datacenter transport, call it DCP, that would plug into Derecho using the SMC and RDMC API and leverage clever tricks to greatly improve performance in TCP settings, leverage hardware like bump-in-the-wire crypto, etc. I think this would be quite interesting.

The other angle worth thinking about is the null send mechanism. Derecho is sending actual null multicasts if a sender isn’t ready, and with RDMA latency and speeds, that strategy is a good one. But with high latency and lower speeds, the delay to discover that everyone is waiting for some process to send a null multicast is going to be an issue and we wouldn’t be running at the highest possible speeds. In fact in this one aspect, Derecho’s optimality proof doesn’t apply. Yet this is central to performance.

So, I think one could do a lot at that level, in terms of how many nulls to send when sending a null (if a process isn’t going to send in this round, maybe it should send a message “ignore me for 2 rounds”, or it could be 4, 8, etc. Then of course we need to somehow ensure that if it suddenly does need to send but by now, but everyone else is idle, that even if it last said “skip me for 256 rounds” it still should have a way to send its message promptly and have it delivered promptly. I see a lot more that could be done in this layer of the solution. A higher latency transport definitely might push us into domains of behavior where that would be needed. Networking over cellular 5G is especially interesting.

I’m not so crazy about ideas to run on unreliable transports. Derecho currently requires a reliable lower layer: fifo, lossless and without duplications. I don’t see much of an advantage for modifying Derecho at a higher level to work around lower level unreliability.

The other thing I really don’t see would be any value to switching to a Paxos with a majority-write, quorum-read policy. Our system assumes that every write reaches every receiver, and this really does pay off for speedy reads. So although Acuerdo has a more standard Paxos and actually does outperform Derecho (build on the same lower layers!) for patterns in which some servers are very laggy, I really prefer Spindle’s opportunistic batching as the way to handle those behaviors (Acuerdo was never compared against the Spindle version of Derecho as far as I know). But, the Acuerdo authors had a lot of trouble finding a way to publish their paper, so apparently the NSDI and SIGCOMM folks weren’t excited by that topic either. Or maybe the paper was just not polished enough, or the experiments somehow weren’t convincing enough. Personally, I really liked the draft I was shown. But even so, I think that opportunistic batching is inherently a better match to the balky server scenarios, and I bet we could equal or beat Acuerdo on the same workload using the Spindle optimizations.

One last direction of interest would be time sensitive or real-time scenarios where low latency is the top priority rather than throughout. We haven’t really studied this carefully and it isn’t clear to me whether we are missing any major opportunities as a result. As mentioned a while back, the experiment to run here would simply send one multicast at a time with delays between them, so we could have all our batching features enabled but the system would never see any opportunity to use batching. And now the question to ask would be: how close are we to the hardware minimal latencies? Is there a sense in which RDMC and SMC aren’t properly optimized or even aren’t designed properly for this case? Very interesting open topic.

0 replies

Steamgjk · 2022-04-10T23:45:03Z

Steamgjk
Apr 10, 2022
Author

Thanks for the insightful comments and help.
Now I think I have successfully tuned to reproduce the performance number of Spindle paper.
@ellerre Lorenzo: You are right, it is the 25G NIC limiting the performance. After I reserved the c6525-100g instances, the throughput number is doubled.

RDMA, All Sender with 1000B msgs, the total throughput reaches 626K msgs/sec

with One sender, the throughput is about 313K msgs/sec

After I change the message size to 100B, the throughput remains at a high level (even slightly better), with All-sender, 634K msgs/sec; One-Sender: 305K msg/sec

Then after I change to TCP backend, All-Sender throughput is about 17.4K msg/sec, One-Sender throughput is 5.6K msg/sec

Then I move back to GCP: All-Sender throughput is about 16.8K msg/sec, One-Sender throughput is 5.0K msg/sec

I think I have come to the conclusions:

Derecho achieves great performance, given RDMA capabilty and proper NIC hardware (100G)
Derecho's performance suffers when using the tcp backend (espeically in public cloud), but that is not the fault of Derecho design, it is mainly due to the libfabric, which is mainly designed for RDMA but does not care much about TCP. I believe Derecho will do better with a better TCP backend.

For now, I think I have got enough performance number for my reference. Thanks for the help!.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Derecho performance in public cloud #237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

About Derecho performance in public cloud #237

Steamgjk Apr 5, 2022

Replies: 5 comments · 9 replies

KenBirman Apr 5, 2022 Maintainer

Steamgjk Apr 5, 2022 Author

KenBirman Apr 5, 2022 Maintainer

KenBirman Apr 8, 2022 Maintainer

KenBirman Apr 8, 2022 Maintainer

Steamgjk Apr 9, 2022 Author

Steamgjk Apr 9, 2022 Author

KenBirman Apr 9, 2022 Maintainer

ellerre Apr 9, 2022 Collaborator

Steamgjk Apr 9, 2022 Author

ellerre Apr 10, 2022 Collaborator

KenBirman Apr 10, 2022 Maintainer

Steamgjk Apr 10, 2022 Author

Steamgjk
Apr 5, 2022

Replies: 5 comments 9 replies

KenBirman
Apr 5, 2022
Maintainer

Steamgjk Apr 5, 2022
Author

KenBirman
Apr 5, 2022
Maintainer

KenBirman Apr 8, 2022
Maintainer

KenBirman Apr 8, 2022
Maintainer

Steamgjk Apr 9, 2022
Author

Steamgjk Apr 9, 2022
Author

KenBirman Apr 9, 2022
Maintainer

ellerre
Apr 9, 2022
Collaborator

Steamgjk Apr 9, 2022
Author

ellerre Apr 10, 2022
Collaborator

KenBirman
Apr 10, 2022
Maintainer

Steamgjk
Apr 10, 2022
Author