Error handling and recovery for DC failures #422

peterzeller · 2020-05-27T10:02:03Z

When a DC temporarily fails for about 1 minute the other DCs also fail to communicate.
After the failing DC has restarted the DCs needs to be joined again manually.

This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine.

Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy.
If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse.
Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ?

We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire)
we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us
what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC.
we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other.

peterzeller · 2020-05-27T10:04:58Z

@shamouda Since you've been working on adding redundancy and fault tolerance to DCs, maybe you can comment if you also observed this kind of problem and if your work will fix this.

marc-shapiro · 2020-05-27T11:10:12Z

Peter, I read Matthew’s description differently. He talks about replicas, which makes me thing that each node is emulating a full DC. The three nodes represent three DCs. Doesn’t this look a lot like the known issue with lack of anti-entropy? Marc

…

Le 27 mai 2020 à 12h02, Peter Zeller ***@***.***> a écrit : When a node within a DC temporarily fails for about 1 minute the DC does not recover automatically after the node comes back up. The cluster needs to be joined again manually. One would expect that with one node down the DC would still be able to serve requests for keys not stored on the failed node. Apparently that is not the case. This was reported by Matthew on Slack, full report below. I have not yet tried to reproduce it on my machine. Hi, we have antidote running on three machines, all directly connected to each other via a dedicated interface (so 2 interfaces per machine, with 3 total wires). It behaves correctly in the absence of failures; however, there is an issue when we test bringing down interfaces between the nodes. With 3 nodes in a cluster, bringing down the interfaces of Node 1, one at a time, causes an asymmetric connection between the other two connected nodes. The behavior we are witnessing is that updates are not replicated in both directions. Node 2 can send updates that are replicated to Node 3, but not vice versa. Neither node 2 nor node 3 have had their interfaces touched, and their dedicated link remains healthy. If we take down the interfaces on Node 1 all at once the cluster stays healthy. We are thinking that this could be because between when Node 1 loses connection to Node 2 and when Node 1 loses connection to Node 3, Node 1 is reporting Node 2's “failure” to Node 3, causing Nodes 1 and 3 to believe they are a majority partition. Then when Node 1 loses connection to Node 3, Node 3 believes it is alone. What is surprising to us is that Node 2's updates continue to reach Node 3 in this scenario, but not the reverse. Have we hit upon the correct diagnosis for our strange behavior? If we have, do you folks know how we can resolve this network state ? We're bringing down the connection within a single datacenter, and restoring it after about 1 minute (just long enough for timeouts to fire) we do need to manually resubscribe to the restored node when it returns if we keep it down long enough, but that's not what worries us what worries us is that two unrelated nodes experience communication interruption after we take one node down. all nodes are in the same DC. we had assumed that the only possible effect of restricting communication to a single node in the DC is that the remaining healthy members would unsubscribe from that node. we did not anticipate that this could cause healthy nodes to unsubscribe from each other. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#422>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABS23H7WX6TGCZHSEVMKCJLRTTQK7ANCNFSM4NL7KSDA>.

mpmilano · 2020-05-27T22:11:54Z

Actually I had gotten confused myself about our setup. Marc is right, we have each node representing its own DC.

Mrhea · 2020-05-27T22:22:50Z

We are using the native (Erlang) API to connect each DC via RPC. Not one of the clients that Peter asked about on Slack that use createDc or connectDcs.

peterzeller · 2020-05-28T09:59:13Z

Thanks for the clarification.

Maybe you can try if the problem is fixed with the changes from pull request #421.

You can either compile the branch yourself or use the Docker image peterzel/antidote:interdc_log.

Mrhea · 2020-06-02T19:31:01Z

Unfortunately those changes did not change the behavior we are seeing.

peterzeller changed the title ~~Error handling and recovery for node failures within a DC~~ Error handling and recovery for DC failures May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling and recovery for DC failures #422

Error handling and recovery for DC failures #422

peterzeller commented May 27, 2020 •

edited

Loading

peterzeller commented May 27, 2020

marc-shapiro commented May 27, 2020 via email

mpmilano commented May 27, 2020

Mrhea commented May 27, 2020

peterzeller commented May 28, 2020

Mrhea commented Jun 2, 2020

Error handling and recovery for DC failures #422

Error handling and recovery for DC failures #422

Comments

peterzeller commented May 27, 2020 • edited Loading

peterzeller commented May 27, 2020

marc-shapiro commented May 27, 2020 via email

mpmilano commented May 27, 2020

Mrhea commented May 27, 2020

peterzeller commented May 28, 2020

Mrhea commented Jun 2, 2020

peterzeller commented May 27, 2020 •

edited

Loading