Fill out support procedures section

copejon · Apr 12, 2024 · 33fd61f · 33fd61f
1 parent 6ee45ce
commit 33fd61f
Showing 1 changed file with 10 additions and 44 deletions.
diff --git a/enhancements/network/configure-ovs-alternative.md b/enhancements/network/configure-ovs-alternative.md
@@ -6,7 +6,7 @@ authors:
 reviewers:
   - "@jcaamano"
   - "@trozet"
-  - "@sinnykumari"
+  - "@yuqi-zhang"
 approvers:
   - "@knobunc"
 api-approvers:
@@ -386,50 +386,16 @@ NA
 
 ## Support Procedures
 
-Describe how to
-- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
-  alerts, which log output in which component)
+If there is a problem with the provided network configuration, NMState will
+fail to apply it and the deployment will fail because br-ex will not be
+present. However, because the node is required to have functional networking
+prior to this config being applied, NetworkManager should roll back any bad
+changes, leaving the node accessible via its original network config.
 
-  Examples:
-  - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
-  - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
-  - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
-    will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
-
-- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
-
-  - What consequences does it have on the cluster health?
-
-    Examples:
-    - Garbage collection in kube-controller-manager will stop working.
-    - Quota will be wrongly computed.
-    - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
-      Disabling the conversion webhook will break garbage collection.
-
-  - What consequences does it have on existing, running workloads?
-
-    Examples:
-    - New namespaces won't get the finalizer "xyz" and hence might leak resource X
-      when deleted.
-    - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
-      communication after some minutes.
-
-  - What consequences does it have for newly created workloads?
-
-    Examples:
-    - New pods in namespace with Istio support will not get sidecars injected, breaking
-      their networking.
-
-- Does functionality fail gracefully and will work resume when re-enabled without risking
-  consistency?
-
-  Examples:
-  - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
-    will not block the creation or updates on objects when it fails. When the
-    webhook comes back online, there is a controller reconciling all objects, applying
-    labels that were not applied during admission webhook downtime.
-  - Namespaces deletion will not delete all objects in etcd, leading to zombie
-    objects when another namespace with the same name is created.
+On day 2, Kubernetes-NMState has health probes that are run to verify every
+config applied. If there is a problem, the config will be rolled back and the
+NNCP will be set to an error state. At worst this should result in a temporary
+outage on a single node.
 
 ## Alternatives