From 38728d23ae05624da18676c0d868e1014d8b7af4 Mon Sep 17 00:00:00 2001 From: Sunyanan Choochotkaew Date: Tue, 5 Dec 2023 12:30:41 +0900 Subject: [PATCH 1/4] add missing field on unmanaged host Signed-off-by: Sunyanan Choochotkaew --- document/docs/release/alpha.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/document/docs/release/alpha.md b/document/docs/release/alpha.md index 7e98a556..129d6b7c 100644 --- a/document/docs/release/alpha.md +++ b/document/docs/release/alpha.md @@ -74,9 +74,15 @@ - hostIP: "" interfaceName: eth1 netAddress: 192.168.0.0/24 + pciAddress: "" + product: "" + vendor: "" - hostIP: "" interfaceName: eth2 netAddress: 192.168.1.0/24 + pciAddress: "" + product: "" + vendor: "" * Multi-gateway route configuration support From 675ff9664f3a2be9c4cac2dd470ada1a3c538d9a Mon Sep 17 00:00:00 2001 From: Sunyanan Choochotkaew Date: Tue, 5 Dec 2023 16:54:56 +0900 Subject: [PATCH 2/4] add link to github Signed-off-by: Sunyanan Choochotkaew --- document/docs/index.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/document/docs/index.md b/document/docs/index.md index de5ada41..5cf0a4d4 100644 --- a/document/docs/index.md +++ b/document/docs/index.md @@ -3,3 +3,5 @@ Multi-NIC CNI is an open source project implementing CNI plugin and operating on top of [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni). However, unlike Multus, instead of defining and handling each secondary network interface one by one, this CNI automatically discovers all available secondary interfaces and handles them as a NIC pool. ![](./img/commonstack.png) + +Check out the project on GitHub ➡️ [Multi-NIC CNI](https://github.com/foundation-model-stack/multi-nic-cni). From fee2e459c53a481fd1809a7f6b9e982db917cf03 Mon Sep 17 00:00:00 2001 From: Sunyanan Choochotkaew Date: Tue, 12 Dec 2023 17:51:50 +0900 Subject: [PATCH 3/4] add instruction to add cni log to daemon Signed-off-by: Sunyanan Choochotkaew --- document/docs/user_guide/troubleshooting.md | 32 ++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/document/docs/user_guide/troubleshooting.md b/document/docs/user_guide/troubleshooting.md index 7a2210d2..3fa763ce 100644 --- a/document/docs/user_guide/troubleshooting.md +++ b/document/docs/user_guide/troubleshooting.md @@ -154,6 +154,36 @@ If yes, try [restarting multi-nic-cni controller node](#restart-controller) to f ## Actions ### Get CNI log (available after v1.0.3) +To make CNI log available on the daemon pod, you may mount the the host log path to the daemon pod: + +- Run + +```bash +kubectl edit config.multinic multi-nicd +``` + +- Add the following mount items + +```yaml +# config/multi-nicd +spec: + daemon: + mounts: + ... + - hostpath: /var/log/multi-nic-cni.log + name: cni-log + podpath: /host/var/log/multi-nic-cni.log + - hostpath: /var/log/multi-nic-ipam.log + name: ipam-log + podpath: /host/var/log/multi-nic-ipam.log + # For AWS-IPVLAN main plugin log also add the following lines: + # - hostpath: /var/log/multi-nic-aws-ipvlan.log + # name: ipam-log + # podpath: /host/var/log/multi-nic-aws-ipvlan.log +``` + +Then, you can get CNI log from the following commands: + ```bash # default main plugin kubectl exec $(kubectl get po -owide -A|grep multi-nicd\ @@ -163,7 +193,7 @@ kubectl exec $(kubectl get po -owide -A|grep multi-nicd\ # multi-nic on aws main plugin kubectl exec $(kubectl get po -owide -A|grep multi-nicd\ |grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')\ --- cat /var/log/multi-nic-aws-ipvlan.log +-- cat /host/var/log/multi-nic-aws-ipvlan.log # IPAM plugin kubectl exec $(kubectl get po -owide -A|grep multi-nicd\ From f1cdc0157b55915721580c25950c4521ce00109e Mon Sep 17 00:00:00 2001 From: Sunyanan Choochotkaew Date: Mon, 15 Jan 2024 18:12:50 +0900 Subject: [PATCH 4/4] update troubleshooting steps with summary table Signed-off-by: Sunyanan Choochotkaew --- .../health-checker.md | 0 .../troubleshooting.md | 172 +++++++++++++++++- 2 files changed, 162 insertions(+), 10 deletions(-) rename document/docs/{user_guide => troubleshooting}/health-checker.md (100%) rename document/docs/{user_guide => troubleshooting}/troubleshooting.md (61%) diff --git a/document/docs/user_guide/health-checker.md b/document/docs/troubleshooting/health-checker.md similarity index 100% rename from document/docs/user_guide/health-checker.md rename to document/docs/troubleshooting/health-checker.md diff --git a/document/docs/user_guide/troubleshooting.md b/document/docs/troubleshooting/troubleshooting.md similarity index 61% rename from document/docs/user_guide/troubleshooting.md rename to document/docs/troubleshooting/troubleshooting.md index 3fa763ce..5a6d0982 100644 --- a/document/docs/user_guide/troubleshooting.md +++ b/document/docs/troubleshooting/troubleshooting.md @@ -1,4 +1,36 @@ # Manual Troubleshooting + +** Please first confirm feature supports on each multi-nic-cni release version from [here](../release/index.md). ** + + + +- [Issues](#issues) + - [Pod failed to start](#pod-failed-to-start) + - [Pod failed to start (Summary Table)](#pod-failed-to-start-summary-table) + - [Ping failed](#ping-failed) + - [TCP/UDP communication failed.](#tcpudp-communication-failed) +- [Actions](#actions) + - [Controller configuration](#controller-configuration) + - [Daemon configuration](#daemon-configuration) + - [List in-use pods](#list-in-use-pods) + - [Get CNI log (available after v1.0.3)](#get-cni-log-available-after-v103) + - [Get Controller log](#get-controller-log) + - [Get multi-nicd log](#get-multi-nicd-log) + - [Deploy multi-nicd config](#deploy-multi-nicd-config) + - [Set security groups](#set-security-groups) + - [Add secondary interfaces](#add-secondary-interfaces) + - [Restart controller](#restart-controller) + - [Restart multi-nicd](#restart-multi-nicd) + - [Check host secondary interfaces](#check-host-secondary-interfaces) + - [Update daemon pod to use latest version](#update-daemon-pod-to-use-latest-version) + - [Update controller to use latest version](#update-controller-to-use-latest-version) + - [Safe upgrade Multi-NIC CNI operator](#safe-upgrade-multi-nic-cni-operator) + + + +## Issues +There are commonly three steps of issue: at pod creation, simple ICMP (ping) communication, TCP/UDP communication. The most complicated one is at pod creation. + Before start troubleshooting, set common variables for reference simplicity. ```bash export FAILED_POD= # pod that fails to run @@ -7,18 +39,56 @@ export FAILED_NODE= # node where pod is deployed export FAILED_NODE_IP = # IP of FAILED_NODE export MULTI_NIC_NAMESPACE= # namespace where multi-nic cni operator is deployed, default=multi-nic-cni-operator ``` -## Issues + ### Pod failed to start + +**Issue:** Pod stays pending in `ContainerCreating` status. Get more information from `describe` ```bash kubectl describe $FAILED_POD -n $FAILED_POD_NAMESPACE ``` -- Error: `FailedCreatePodSandBox` - * [CNI binary not found](#cni-binary-not-exist) - * [IPAM ExecAdd failed](#ipam-execadd-failed) -##### CNI binary not exist +Find the following keyword from `FailedCreatePodSandBox`: + +* [Network not found](#network-not-found) +* [CNI binary not found](#cni-binary-not-found) +* [IPAM ExecAdd: failed](#ipam-execadd-failed) +* [IPAM plugin returned missing IP config](#ipam-plugin-returned-missing-ip-config) +* [zero config](#zero-config) + +### Pod failed to start (Summary Table) +For those who are familar to action command (e.g., list multinic CRs, list daemon pods), you may troubleshoot with the summary table: + +> - Investigate source of issue from top to bottom +> - *X* refers to no relevance +> - If the issue cannot be solved by configuration (multinicnetwork, annotation, host network, config.multinic) and last patch of [controller](#update-daemon-pod-to-use-latest-version) and [multi-nicd](#update-daemon-pod-to-use-latest-version), please report the [issue](https://github.com/foundation-model-stack/multi-nic-cni/issues) with the corresponding log. +> - *The solved bug on CNI binary requires node restart. + +Potential source of Issue|Network not found|CNI binary not found|- IPAM ExecAdd: failed
- IPAM plugin returned missing IP config|zero config|Fail execPlugin +---|---|---|---|---|--- +**multinicnetwork definition/annotation**|- annotation missing/mismatch
- multinicnetwork wrong configured|X|- IPAM wrong configured
- `masters` multinicnetwork spec missing (> 1 multinicnetwork)|**non-IP host:**
- no master name provided via multi-config or annotation|X +**host network**|X|X|X|**L3:**
- daemon communication blocked
**All:**
- interface missing
|X +**controller**|- net-attach-def not created|- daemon not created due to wrong configured (config.multinic)|**L3:**
- daemon/hostinterface not created
- CIDR/IPPool not created/unsynced|X|X +**daemon**
(multi-nicd)|X|X|**L3:**
- failed to discover hostinterface
- IP limit reach
**All cases:**
- hang on no-respond API server (should be fixed by [#172](https://github.com/foundation-model-stack/multi-nic-cni/pull/172))|X|X +**main CNI binary**
(multi-nic)|X|X|- *failed to clean up previous pod network (should be fixed by [#165](https://github.com/foundation-model-stack/multi-nic-cni/pull/165))|**host-device**
- *failed to clean up previous pod network (should be fixed by [#152](https://github.com/foundation-model-stack/multi-nic-cni/issues/152))|X +**ipam CNI binary**
(multi-nic-ipam)|X|X|- *failed to clean up previous ip allocation (should be fixed by [#104](https://github.com/foundation-model-stack/multi-nic-cni/pull/104))|X|X +**3rd-party CNI binary**|X|- binary missing|- 3rd-party IPAM failure|X|- 3rd-party main plugin failure + + + + +#### Network not found + +```bash +kubectl get multinicnetwork # multinicnetwork resource created +kubectl get $FAILED_POD -n $FAILED_POD_NAMESPACE -oyaml|grep "k8s.v1.cni.cncf.io/networks" # pod annotation matched +kubectl get net-attach-def # network-attachment-definition created +``` + +If net-attach-def is missing (`No resources found in default namespace`), check [controller log](#get-controller-log) to see whether the failure comes from misconfiguration in multinicnetwork (Marshal failure) or network-attachment-definition creation request to API server. + +#### CNI binary not found The binary file of CNI is not in the expected location read by Multus. The expected location can be found in Multus daemonset as below. ```bash @@ -54,7 +124,9 @@ The expected location is in *hostPath* of *cnibin*. - **missing other CNI such as ipvlan** The missing CNI may not be supported. -##### IPAM ExecAdd failed +#### IPAM ExecAdd: failed +This error occurs when CNI cannot execute Multi-NIC IPAM which can be caused by multiple reasons as follows. + - `failed to load netconf` The configuration cannot be loaded. This is delegated CNI (such as IPVLAN) issue. @@ -76,11 +148,12 @@ The expected location is in *hostPath* of *cnibin*. - If no interfaces in `.spec.interfaces`, check [HostInterface does not show the secondary interfaces.](#no-secondary-interfaces-in-hostinterface) - Check whether it reaches CIDR block limit, confirm [no available IP address](#no-secondary-interfaces-in-hostinterface) - Other cases, find more details from [multi-nicd log](#get-multi-nicd-log) + - Multi-nicd daemon pod has no response, [restart multi-nicd](#restart-multi-nicd) might help. + - other CNI plugin (such as aws-vpc-cni, sr-iov) failure, check each CNI log. - aws-vpc-cni: `/host/var/log/aws-routed-eni` - -###### HostInterface not created +##### HostInterface not created There are a couple of reasons that the HostInterface is not created. First check the multi-nicd DaemonSet. ```bash kubectl get ds multi-nicd -n $MULTI_NIC_NAMESPACE -oyaml @@ -102,7 +175,7 @@ kubectl get ds multi-nicd -n $MULTI_NIC_NAMESPACE -oyaml |grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}') - Other cases, check [controller log](#get-controller-log) -###### No secondary interfaces in HostInterface +##### No secondary interfaces in HostInterface The HostInterface is created but there is no interface listed in the custom resource. @@ -116,14 +189,24 @@ kubectl logs --selector control-plane=controller-manager \ - If no line shown up and the full [controller log](#get-controller-log) keep printing `Fail to create hostinterface ... cannot update interfaces: Get "/interface": dial tcp :11000: i/o timeout`, check [set required security group rules](#set-security-groups) - Other cases, [check interfaces at node's host network](#check-host-secondary-interfaces) -###### No available IP address +##### No available IP address List corresponding Pod CIDR from HostInterface. ```bash kubectl get HostInterface $FAILED_NODE -oyaml ``` Check ippools.multinic.fms.io of the corresponding pod CIDR whether the IP address actually reach the limit. If yes, consider changing the host block and interface block in `multinicnetworks.multinic.fms.io`. +#### IPAM plugin returned missing IP config + +No IP address set from the multi-nic type IPAM without throwing an error. To troubleshoot, we need additional information from [IPAM CNI log](#get-cni-log-available-after-v103). + +#### Zero config + +Zero config occurs when CNI cannot generate configurations from the network-attachment-definition. To troubleshoot, we need additional information from [CNI log](#get-cni-log). + ### Ping failed +**Issue:** Pods cannot ping each other. + Check route status in multinicnetworks.multinic.fms.io. ```bash kubectl get multinicnetwork.multinic.fms.io multinic-ipvlanl3 -o json\ @@ -139,6 +222,8 @@ kubectl get multinicnetwork.multinic.fms.io multinic-ipvlanl3 -o json\ - *Success*: check [set required security group rules](#set-security-groups) ### TCP/UDP communication failed. +**Issue:** Pods can ping each other but do not get response from TCP/UDP communication such as iPerf. + Check whether the multi-nicd detects the other host interfaces. ```bash kubectl get po $(kubectl get po -owide -A|grep multi-nicd\ @@ -151,7 +236,68 @@ The nubmer in `multi-nicd-join` should be equal to accumulated number of interfa [Check whether the host secondary interfaces between hosts are connected](#check-host-secondary-interfaces). If yes, try [restarting multi-nic-cni controller node](#restart-controller) to forcefully synchronize host interfaces. + ## Actions +Available configurations on `config.multinic/multi-nicd`: + +### Controller configuration + +These following controller configuration values will be applied on-the-fly (no need to restart the controller pod). + +Configuration|Description|Default Value +---|---|--- +.spec.logLevel|controller's verbose log level|4 +.spec.urgentReconcileSeconds|time to requeue reconcile after instant failure in second unit|5 seconds +.spec.normalReconcileMinutes|time to requeue reconcile while waiting for initial configuration in minute unit|1 minute +.spec.longReconcileMinutes|time to requeue reconcile when sensing control traffic failure in minute unit|10 minutes +.spec.contextTimeoutMinutes|time out for API server call context in minute unit|2 minutes + +#### Log Levels + +Verbose Level | Information +---|--- +1|- critical error (cannot create/update resource by k8s API)
- "Set Config" key
- set up log
- config error +2|- significant events/failures of multinicnetwork +3|- significant events/failures of cidr +4 (default)|- significant events/failures of hostinterface +5|- significant events/failures of ippools +6|- significant events/failures of route configurations +7|- requeue
- get deleted resource
- debug pointers (e.g., start point of function call) + + +### Daemon configuration + +Configuration|Description|Type|Default Value +---|---|---|--- +.spec.daemon.port|multi-nicd serving port|int|11000 +.spec.daemon.mounts|additional host-path mount|HostPathMount| + +```yaml +# HostPathMount +mounts: +- name: mountName + podpath: path/on/pod + hostpath: path/on/host +``` + +Additionally, the following common [apps/DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) configurations are also available under `.spec.daemon`. +- nodeSelector +- image +- imagePullSecret +- imagePullPolicy +- securityContext +- env +- envFrom +- resources +- tolerations + +### List in-use pods + +**modify '< MULTINICNETWORK NAME HERE >'** in the following command with your target.multinicnetwork name + +```bash +kubectl get po -A -ojson| jq -r '.items[]|select(.metadata.annotations."k8s.v1.cni.cncf.io/networks"=="< MULTINICNETWORK NAME HERE >")|.metadata.namespace + " " + .metadata.name' +``` ### Get CNI log (available after v1.0.3) To make CNI log available on the daemon pod, you may mount the the host log path to the daemon pod: @@ -243,6 +389,12 @@ kubectl delete --selector control-plane=controller-manager \ -n $MULTI_NIC_NAMESPACE ``` +### Restart multi-nicd +```bash +kubectl delete po $(kubectl get po -owide -A|grep multi-nicd\ +|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}') +``` + ### Check host secondary interfaces Log in to FAILED_NODE with `oc debug node/$FAILED_NODE` or using [nettools](https://github.com/jedrecord/nettools) with `hostNetwork: true`. If secondary interfaces do not exist at the host network, [add the secondary interfaces](#add-secondary-interfaces-at-nodes-host-network)