Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14) Oct. 2014.
Terminology:
- Catastrophic: affects all or a majority of users
- Failures are the results of complex sequence of events
- Catastrophic failures are caused by incorrect error handling
- Many are caused by a small set of trivial bug patterns
- Aspirator: a simple rule-based static checker
- Found 143 confirmed new bugs and bad practices
Most failures require multiple events to trigger. Need stress testing plus other techniques such as fault injection.
Order also matters.
- A majority (77%) of failures require more than one input event to manifest, but most of the failures (90%) require no more than 3.
- The specific order of events is important in 88% of the failures that require multiple input events.
- Almost all (98%) of the failures are guaranteed to manifest on more than than 3 nodes. 84% will manifest on no more than 2 nodes
- 74% of failures are deterministic
- 53% of deterministic failures have timing constraints only on the input events.
- 76% of the failures print explicit failure-related error messages
- For a majority (84%) of the failures, all of their triggering events are logged.
- Logs are noisy: the median of the number of log messages printed by each failure is 824.
- A majority of the production failures (77%) can be reproduced by a unit test.
- Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software
- 35% of the catastrophic failures are caused by trivial mistakes in error handling logic -- ones that simply violate best programming practices; and that can be detected without specific knowledge.
- In 23% of catastrophic failures, incorrect error handling would have been exposed by 100% statement coverage testing on the error handling logic.
- Complexity of failures (most aren't that complex)
- Role of timing (often deterministic enough that they could be reproduced in a test)
- Logs enable diagnosis opportunities (logs are good, but noisy)
- Failure reproducibility (these are catchaable in tests if we know where to look)
- Catastrophic failures
- Trivial mistake in error handlers
- System-specific bugs
- Starting up services: more than half of the failures require the start of some services
- Unreachable nodes: 24% failures occur because a node is unreachable
- Configuration changes: 23% of failures are caused be config changes.
- 30% of these involve misconfigurations
- Remaining majority involve valid changes to enable certain features that may be rarely-used
- Adding a node: 15% of failures are triggered by adding a node
- Representativeness of the selected systems
- Representativeness of the selected failures
- Size of our sample set
- Possible observer errors
- We should be able to catch most of these issues in unit testing
- Logs are useful but noisy
- We should put more effort into writing the error-handling code
- Daniel Jackson's "Small Scope Hypothesis": most bugs have small counterexamples
- Most failure can be reproduced with unit tests