From 1b0c19075c495cfa672c7a3987a11232671eaf36 Mon Sep 17 00:00:00 2001 From: asraa Date: Thu, 21 Jan 2021 11:33:23 -0500 Subject: [PATCH] [docs] Add guidance on ENVOY_BUG in STYLE.md (#14575) * Add guidanceon ENVOY_BUG and macro usage to STYLE.md Signed-off-by: Asra Ali --- STYLE.md | 118 +++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 88 insertions(+), 30 deletions(-) diff --git a/STYLE.md b/STYLE.md index 3feafbcc10a0..7156614b1d48 100644 --- a/STYLE.md +++ b/STYLE.md @@ -106,24 +106,35 @@ A few general notes on our error handling philosophy: * All error code returns should be checked. * At a very high level, our philosophy is that errors should be handled gracefully when caused by: - - Untrusted network traffic OR + - Untrusted network traffic (from downstream, upstream, or extensions like filters) - Raised by the Envoy process environment and are *likely* to happen + - Third party dependency return codes * Examples of likely environnmental errors include any type of network error, disk IO error, bad - data returned by an API call, bad data read from runtime files, etc. Errors in the Envoy - environment that are *unlikely* to happen after process initialization, should lead to process - death, under the assumption that the additional burden of defensive coding and testing is not an - effective use of time for an error that should not happen given proper system setup. Examples of - these types of errors include not being able to open the shared memory region, system calls that - should not fail assuming correct parameters (which should be validated via tests), etc. Examples - of system calls that should not fail when passed valid parameters include the kernel returning a - valid `sockaddr` after a successful call to `accept()`, `pthread_create()`, `pthread_join()`, etc. -* OOM events (both memory and FDs) are considered fatal crashing errors. An OOM error should never - silently be ignored and should crash the process either via the C++ allocation error exception, an - explicit `RELEASE_ASSERT` following a third party library call, or an obvious crash on a subsequent - line via null pointer dereference. This rule is again based on the philosophy that the engineering - costs of properly handling these cases are not worth it. Time is better spent designing proper system - controls that shed load if resource usage becomes too high, etc. -* The "less is more" error handling philosophy described in the previous two points is primarily + data returned by an API call, bad data read from runtime files, etc. This includes loading + configuration at runtime. +* Third party dependency return codes should be checked and gracefully handled. Examples include + HTTP/2 or JSON parsers. Some return codes may be handled by continuing, for example, in case of an + out of process RPC failure. +* Testing should cover any serious cases that may result in infinite loops, crashes, or serious + errors. Non-trivial invariants are also encouraged to have testing. Internal, localized invariants + may not need testing. +* Errors in the Envoy environment that are *unlikely* to happen after process initialization, should + lead to process death, under the assumption that the additional burden of defensive coding and + testing is not an effective use of time for an error that should not happen given proper system + setup. Examples of these types of errors include not being able to open the shared memory region, + system calls that should not fail assuming correct parameters (which should be validated via + tests), etc. Examples of system calls that should not fail when passed valid parameters include + the kernel returning a valid `sockaddr` after a successful call to `accept()`, `pthread_create()`, + `pthread_join()`, etc. However, system calls that require permissions may cause likely errors in + some deployments and need graceful error handling. +* OOM events (both memory and FDs) or ENOMEM errors are considered fatal crashing errors. An OOM + error should never silently be ignored and should crash the process either via the C++ allocation + error exception, an explicit `RELEASE_ASSERT` following a third party library call, or an obvious + crash on a subsequent line via null pointer dereference. This rule is again based on the + philosophy that the engineering costs of properly handling these cases are not worth it. Time is + better spent designing proper system controls that shed load if resource usage becomes too high, + etc. +* The "less is more" error handling philosophy described in the previous points is primarily based on the fact that restarts are designed to be fast, reliable and cheap. * Although we strongly recommend that any type of startup error leads to a fatal error, since this is almost always a result of faulty configuration which should be caught during a canary process, @@ -148,20 +159,67 @@ A few general notes on our error handling philosophy: continue seems ridiculous because *"this should never happen"*, it's a very good indication that the appropriate behavior is to terminate the process and not handle the error. When in doubt, please discuss. -* Per above it's acceptable to turn failures into crash semantics - via `RELEASE_ASSERT(condition)` or `PANIC(message)` if there is no other sensible behavior, e.g. - in OOM (memory/FD) scenarios. Only `RELEASE_ASSERT(condition)` should be used to validate - conditions that might be imposed by the external environment. `ASSERT(condition)` should be used - to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do - not use it for things that will crash in an obvious way in a subsequent line. E.g., do not do - `ASSERT(foo != nullptr); foo->doSomething();`. Note that there is a gray line between external - environment failures and program invariant violations. For example, memory corruption due to a - security issue (a bug, deliberate buffer overflow etc.) might manifest as a violation of program - invariants or as a detectable condition in the external environment (e.g. some library returning a - highly unexpected error code or buffer contents). Unfortunately no rule can cleanly cover when to - use `RELEASE_ASSERT` vs. `ASSERT`. In general we view `ASSERT` as the common case and - `RELEASE_ASSERT` as the uncommon case, but experience and judgment may dictate a particular approach - depending on the situation. + +# Macro Usage + +* The following macros are available: + - `RELEASE_ASSERT`: fatal check. + - `ASSERT`: fatal check in debug-only builds. These should be used to document (and check in + debug-only builds) program invariants. + - `ENVOY_BUG`: logs and increments a stat in release mode, fatal check in debug builds. These + should be used where it may be useful to detect if an efficient condition is violated in + production (and fatal check in debug-only builds). + +* Sub-macros alias the macros above and can be used to annotate specific situations: + - `ENVOY_BUG_ALPHA` (alias `ENVOY_BUG`): Used for alpha or rapidly changing protocols that need + detectability on probable conditions or invariants. + +* Per above it's acceptable to turn failures into crash semantics via `RELEASE_ASSERT(condition)` or + `PANIC(message)` if there is no other sensible behavior, e.g. in OOM (memory/FD) scenarios. +* Do not `ASSERT` on conditions imposed by the external environment. Either add error handling + (potentially with an `ENVOY_BUG` for detectability) or `RELEASE_ASSERT` if the condition indicates + that the process is unrecoverable. +* Use `ASSERT` and `ENVOY_BUG` liberally, but do not use them for things that will crash in an obvious + way in a subsequent line. E.g., do not do `ASSERT(foo != nullptr); foo->doSomething();`. +* Use `ASSERT`s for true invariants and well-defined conditions that are useful for tests, + debug-only checks and documentation. They may be `ENVOY_BUG`s if performance allows, see point + below. +* `ENVOY_BUG`s provide detectability and more confidence than an `ASSERT`. They are useful for + non-trivial conditions, those with complex control flow, and rapidly changing protocols. Testing + should be added to ensure that Envoy can continue to operate even if an `ENVOY_BUG` condition is + violated. +* Annotate conditions with comments on belief or reasoning, for example `Condition is guaranteed by + caller foo` or `Condition is likely to hold after processing through external library foo`. +* Macro usage should be understandable to a reader. Add comments if not. They should be robust to + future changes. +* Note that there is a gray line between external environment failures and program invariant + violations. For example, memory corruption due to a security issue (a bug, deliberate buffer + overflow etc.) might manifest as a violation of program invariants or as a detectable condition in + the external environment (e.g. some library returning a highly unexpected error code or buffer + contents). Unfortunately no rule can cleanly cover when to use `RELEASE_ASSERT` vs. `ASSERT`. In + general we view `ASSERT` as the common case and `RELEASE_ASSERT` as the uncommon case, but + experience and judgment may dictate a particular approach depending on the situation. The risk of + process death from `RELEASE_ASSERT` should be justified with the severity and possibility of the + condition to avoid unintentional crashes. You may use the following guide: + * If a violation is high risk (will cause a crash in subsequent data processing or indicates a + failure state beyond recovery), use `RELEASE_ASSERT`. + * If a violation is medium or low risk (Envoy can continue safely) and is not expensive, + consider `ENVOY_BUG`. + * Otherwise (if a condition is expensive or test-only), use `ASSERT`. + +Below is a guideline for macro usage. The left side of the table has invariants and the right side +has error conditions that can be triggered and should be gracefully handled. `ENVOY_BUG` represents +a middle ground that can be used for uncertain conditions that need detectability. `ENVOY_BUG`s can +also be added for errors if they warrant detection. + +| `ASSERT`/`RELEASE_ASSERT` | `ENVOY_BUG` | Error handling and Testing | +| --- | --- | --- | +| Low level invariants in data structures | | | +| Simple, provable internal class invariants | Complex, uncertain internal class invariants (e.g. need detectability if violated) | | +| Provable (pre/post)-conditions | Complicated but likely (pre-/post-) conditions that are low-risk (Envoy can continue safely) | Triggerable or uncertain conditions, may be based on untrusted data plane traffic or an extensions’ contract. | +| | Conditions in alpha or changing extensions that need detectability. (`ENVOY_BUG_ALPHA`) | | +| Unlikely environment errors after process initialization that would otherwise crash | | Likely environment errors, e.g. return codes from untrusted extensions, dependencies or system calls, network error, bad data read, permission based errors, etc. | +| Fatal crashing events. e.g. OOMs, deadlocks, no process recovery possible | | | # Hermetic and deterministic tests