[issue-172:dns] Gracefully fail DNS resolver interruptions #173
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Relates to #172.
Synapse crashes if it can't resolve airbnb.com, the specific exception not handled is RuntimeError. I personally don't care that the resolver check is hard-coded to airbnb.com. I also get that synapse is designed to be fail fast. However this fail fast approach has an interesting side effect in certain use cases.
We were testing a pre-production Mesos cluster with synapse as our service discovery layer. We temporarily lost the ability to resolve DNS and synapse crashed on resolution failure. Normally this isn't a big deal.
We are also using Apache Aurora (aka TwitterScheduler) for Mesos and we are running synapse as a process member of the job's task. Under Aurora/Thermos if a single process exits non-zero the whole task fails. Normally this isn't a big deal either.
However since DNS resolution was interrupted, and synapse bailed on resolving 'airbnb.com' (every synapse process through the fleet failed within seconds of each other. This caused every Task that synapse (using zookeeper_dns) was part of to fail. This in turn caused every instance of each Job to fail. This of course caused the jobs to be rescheduled. Had this been a production sized cluster, we would have crushed the storage layer of our stack.