Skip to content

Commit

Permalink
Fill out support procedures section
Browse files Browse the repository at this point in the history
  • Loading branch information
cybertron committed Apr 12, 2024
1 parent 6ee45ce commit 33fd61f
Showing 1 changed file with 10 additions and 44 deletions.
54 changes: 10 additions & 44 deletions enhancements/network/configure-ovs-alternative.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ authors:
reviewers:
- "@jcaamano"
- "@trozet"
- "@sinnykumari"
- "@yuqi-zhang"
approvers:
- "@knobunc"
api-approvers:
Expand Down Expand Up @@ -386,50 +386,16 @@ NA

## Support Procedures

Describe how to
- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
alerts, which log output in which component)
If there is a problem with the provided network configuration, NMState will
fail to apply it and the deployment will fail because br-ex will not be
present. However, because the node is required to have functional networking
prior to this config being applied, NetworkManager should roll back any bad
changes, leaving the node accessible via its original network config.

Examples:
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
- The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.

- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)

- What consequences does it have on the cluster health?

Examples:
- Garbage collection in kube-controller-manager will stop working.
- Quota will be wrongly computed.
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
Disabling the conversion webhook will break garbage collection.

- What consequences does it have on existing, running workloads?

Examples:
- New namespaces won't get the finalizer "xyz" and hence might leak resource X
when deleted.
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
communication after some minutes.

- What consequences does it have for newly created workloads?

Examples:
- New pods in namespace with Istio support will not get sidecars injected, breaking
their networking.

- Does functionality fail gracefully and will work resume when re-enabled without risking
consistency?

Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
will not block the creation or updates on objects when it fails. When the
webhook comes back online, there is a controller reconciling all objects, applying
labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie
objects when another namespace with the same name is created.
On day 2, Kubernetes-NMState has health probes that are run to verify every
config applied. If there is a problem, the config will be rolled back and the
NNCP will be set to an error state. At worst this should result in a temporary
outage on a single node.

## Alternatives

Expand Down

0 comments on commit 33fd61f

Please sign in to comment.