Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

Conversation

JustinVenus
Copy link

Relates to #172.

Synapse crashes if it can't resolve airbnb.com, the specific exception not handled is RuntimeError. I personally don't care that the resolver check is hard-coded to airbnb.com. I also get that synapse is designed to be fail fast. However this fail fast approach has an interesting side effect in certain use cases.

We were testing a pre-production Mesos cluster with synapse as our service discovery layer. We temporarily lost the ability to resolve DNS and synapse crashed on resolution failure. Normally this isn't a big deal.

We are also using Apache Aurora (aka TwitterScheduler) for Mesos and we are running synapse as a process member of the job's task. Under Aurora/Thermos if a single process exits non-zero the whole task fails. Normally this isn't a big deal either.

However since DNS resolution was interrupted, and synapse bailed on resolving 'airbnb.com' (every synapse process through the fleet failed within seconds of each other. This caused every Task that synapse (using zookeeper_dns) was part of to fail. This in turn caused every instance of each Job to fail. This of course caused the jobs to be rescheduled. Had this been a production sized cluster, we would have crushed the storage layer of our stack.

@jolynch
Copy link
Collaborator

jolynch commented Feb 19, 2016

As written this won't work because the main loop ping check. Synapse really is is designed for the most part to be crash-fail instead of crash-recovery (with the exception of running the HAProxy reload command which is crash-recovery). Partly this is because it's easier to implement, and partly because it encourages good practice in making sure your entire control plane (Synapse/Zookeeper/w.e.) can crash and your data plane (HAProxy) keeps on chugging.

I changed Nerve to crash-recover in our nerve fork prior to merging it back upstream. I'd be ok with a similar strategy for Synapse (detecting that Watchers had failed and trying to re-establish them). @igor47 what are your thoughts on a change like that?

@JustinVenus
Copy link
Author

@jolynch I see, I missed the failed ping as the origin of the exception.

I do like the way your nerve fork recovers and that's essentially what I'm looking for in Synapse.

@jolynch
Copy link
Collaborator

jolynch commented Feb 19, 2016

Just to clarify something, that is how airbnb/nerve works (I merged our fork back into upstream).

I'm on-board for a similar piece of functionality for Synapse, basically if something fails a ping reap the watcher and relaunch. The only worry there would be if you had a flakey watcher you might end up restarting HAProxy a lot (check out restart_interval and restart_jitter in the config options for how to prevent that).

Actually it would be more reliable if Synapse auto-remediated watchers because as it is right now every time Synapse starts we restart HAproxy regardless of how recently it was reloaded. If we keep the process up we can make use of the restart_interval parameter to properly limit HAProxy reloads.

@igor47
Copy link
Collaborator

igor47 commented Feb 19, 2016

@jolynch fair point about HAProxy restarts; i would be okay with an overhaul of synapse to make it recover, especially since we've already done that to nerve (and are going to do it more to nerve with your change to reload the config without restarting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants