[issue-172:dns] Gracefully fail DNS resolver interruptions #173

JustinVenus · 2016-02-19T16:06:02Z

Relates to #172.

Synapse crashes if it can't resolve airbnb.com, the specific exception not handled is RuntimeError. I personally don't care that the resolver check is hard-coded to airbnb.com. I also get that synapse is designed to be fail fast. However this fail fast approach has an interesting side effect in certain use cases.

We were testing a pre-production Mesos cluster with synapse as our service discovery layer. We temporarily lost the ability to resolve DNS and synapse crashed on resolution failure. Normally this isn't a big deal.

We are also using Apache Aurora (aka TwitterScheduler) for Mesos and we are running synapse as a process member of the job's task. Under Aurora/Thermos if a single process exits non-zero the whole task fails. Normally this isn't a big deal either.

However since DNS resolution was interrupted, and synapse bailed on resolving 'airbnb.com' (every synapse process through the fleet failed within seconds of each other. This caused every Task that synapse (using zookeeper_dns) was part of to fail. This in turn caused every instance of each Job to fail. This of course caused the jobs to be rescheduled. Had this been a production sized cluster, we would have crushed the storage layer of our stack.

jolynch · 2016-02-19T17:10:53Z

As written this won't work because the main loop ping check. Synapse really is is designed for the most part to be crash-fail instead of crash-recovery (with the exception of running the HAProxy reload command which is crash-recovery). Partly this is because it's easier to implement, and partly because it encourages good practice in making sure your entire control plane (Synapse/Zookeeper/w.e.) can crash and your data plane (HAProxy) keeps on chugging.

I changed Nerve to crash-recover in our nerve fork prior to merging it back upstream. I'd be ok with a similar strategy for Synapse (detecting that Watchers had failed and trying to re-establish them). @igor47 what are your thoughts on a change like that?

JustinVenus · 2016-02-19T17:26:10Z

@jolynch I see, I missed the failed ping as the origin of the exception.

I do like the way your nerve fork recovers and that's essentially what I'm looking for in Synapse.

jolynch · 2016-02-19T17:31:53Z

Just to clarify something, that is how airbnb/nerve works (I merged our fork back into upstream).

I'm on-board for a similar piece of functionality for Synapse, basically if something fails a ping reap the watcher and relaunch. The only worry there would be if you had a flakey watcher you might end up restarting HAProxy a lot (check out restart_interval and restart_jitter in the config options for how to prevent that).

Actually it would be more reliable if Synapse auto-remediated watchers because as it is right now every time Synapse starts we restart HAproxy regardless of how recently it was reloaded. If we keep the process up we can make use of the restart_interval parameter to properly limit HAProxy reloads.

igor47 · 2016-02-19T20:28:38Z

@jolynch fair point about HAProxy restarts; i would be okay with an overhaul of synapse to make it recover, especially since we've already done that to nerve (and are going to do it more to nerve with your change to reload the config without restarting).

[issue-172:dns] Gracefully fail DNS resolver interruptions

b6d60e2

JustinVenus closed this Oct 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

JustinVenus commented Feb 19, 2016

jolynch commented Feb 19, 2016

JustinVenus commented Feb 19, 2016

jolynch commented Feb 19, 2016

igor47 commented Feb 19, 2016

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

Conversation

JustinVenus commented Feb 19, 2016

jolynch commented Feb 19, 2016

JustinVenus commented Feb 19, 2016

jolynch commented Feb 19, 2016

igor47 commented Feb 19, 2016