copyright | lastupdated | keywords | subcollection | ||
---|---|---|---|---|---|
|
2021-04-28 |
kubernetes, iks, logmet, logs, metrics, recovery, auto-recovery |
containers |
{:DomainName: data-hd-keyref="APPDomain"} {:DomainName: data-hd-keyref="DomainName"} {:android: data-hd-operatingsystem="android"} {:api: .ph data-hd-interface='api'} {:apikey: data-credential-placeholder='apikey'} {:app_key: data-hd-keyref="app_key"} {:app_name: data-hd-keyref="app_name"} {:app_secret: data-hd-keyref="app_secret"} {:app_url: data-hd-keyref="app_url"} {:authenticated-content: .authenticated-content} {:beta: .beta} {:c#: data-hd-programlang="c#"} {:cli: .ph data-hd-interface='cli'} {:codeblock: .codeblock} {:curl: .ph data-hd-programlang='curl'} {:deprecated: .deprecated} {:dotnet-standard: .ph data-hd-programlang='dotnet-standard'} {:download: .download} {:external: target="_blank" .external} {:faq: data-hd-content-type='faq'} {:fuzzybunny: .ph data-hd-programlang='fuzzybunny'} {:generic: data-hd-operatingsystem="generic"} {:generic: data-hd-programlang="generic"} {:gif: data-image-type='gif'} {:go: .ph data-hd-programlang='go'} {:help: data-hd-content-type='help'} {:hide-dashboard: .hide-dashboard} {:hide-in-docs: .hide-in-docs} {:important: .important} {:ios: data-hd-operatingsystem="ios"} {:java: .ph data-hd-programlang='java'} {:java: data-hd-programlang="java"} {:javascript: .ph data-hd-programlang='javascript'} {:javascript: data-hd-programlang="javascript"} {:new_window: target="_blank"} {:note .note} {:note: .note} {:objectc data-hd-programlang="objectc"} {:org_name: data-hd-keyref="org_name"} {:php: data-hd-programlang="php"} {:pre: .pre} {:preview: .preview} {:python: .ph data-hd-programlang='python'} {:python: data-hd-programlang="python"} {:route: data-hd-keyref="route"} {:row-headers: .row-headers} {:ruby: .ph data-hd-programlang='ruby'} {:ruby: data-hd-programlang="ruby"} {:runtime: architecture="runtime"} {:runtimeIcon: .runtimeIcon} {:runtimeIconList: .runtimeIconList} {:runtimeLink: .runtimeLink} {:runtimeTitle: .runtimeTitle} {:screen: .screen} {:script: data-hd-video='script'} {:service: architecture="service"} {:service_instance_name: data-hd-keyref="service_instance_name"} {:service_name: data-hd-keyref="service_name"} {:shortdesc: .shortdesc} {:space_name: data-hd-keyref="space_name"} {:step: data-tutorial-type='step'} {:subsection: outputclass="subsection"} {:support: data-reuse='support'} {:swift: .ph data-hd-programlang='swift'} {:swift: data-hd-programlang="swift"} {:table: .aria-labeledby="caption"} {:term: .term} {:tip: .tip} {:tooling-url: data-tooling-url-placeholder='tooling-url'} {:troubleshoot: data-hd-content-type='troubleshoot'} {:tsCauses: .tsCauses} {:tsResolve: .tsResolve} {:tsSymptoms: .tsSymptoms} {:tutorial: data-hd-content-type='tutorial'} {:ui: .ph data-hd-interface='ui'} {:unity: .ph data-hd-programlang='unity'} {:url: data-credential-placeholder='url'} {:user_ID: data-hd-keyref="user_ID"} {:vbnet: .ph data-hd-programlang='vb.net'} {:video: .video}
{: #health-monitor}
Set up monitoring in {{site.data.keyword.containerlong}} to help you troubleshoot issues and improve the health and performance of your Kubernetes clusters and apps. {: shortdesc}
Continuous monitoring and logging is the key to detecting attacks on your cluster and troubleshooting issues as they arise. By continuously monitoring your cluster, you're able to better understand your cluster capacity and the availability of resources that are available to your app. With this insight, you can prepare to protect your apps against downtime.
{: #view_metrics}
Metrics help you monitor the health and performance of your clusters. You can use the standard Kubernetes and container runtime features to monitor the health of your clusters and apps. {: shortdesc}
Every Kubernetes master is continuously monitored by IBM. {{site.data.keyword.containerlong_notm}} automatically scans every node where the Kubernetes master is deployed for vulnerabilities that are found in Kubernetes and OS-specific security fixes. If vulnerabilities are found, {{site.data.keyword.containerlong_notm}} automatically applies fixes and resolves vulnerabilities on behalf of the user to ensure master node protection. You are responsible for monitoring and analyzing the logs for the rest of your cluster components.
To avoid conflicts when using metrics services, be sure that clusters across resource groups and regions have unique names. {: tip}
- {{site.data.keyword.mon_full}}
- Gain operational visibility into the performance and health of your apps and your cluster by deploying a {{site.data.keyword.mon_short}} agent to your worker nodes. The agent collects pod and cluster metrics, and sends these metrics to {{site.data.keyword.mon_full_notm}}. For more information about {{site.data.keyword.mon_full_notm}}, see the [service documentation](/docs/monitoring?topic=monitoring-getting-started). To set up the {{site.data.keyword.mon_short}} agent in your cluster, see [Viewing cluster and app metrics with {{site.data.keyword.mon_full_notm}}](#monitoring).
- Kubernetes dashboard
- The Kubernetes dashboard is an administrative web interface where you can review the health of your worker nodes, find Kubernetes resources, deploy containerized apps, and troubleshoot apps with logging and monitoring information. For more information about how to access your Kubernetes dashboard, see [Launching the Kubernetes dashboard for {{site.data.keyword.containerlong_notm}}](/docs/containers?topic=containers-deploy_app#cli_dashboard).
{: #monitoring}
Use the {{site.data.keyword.containerlong_notm}} observability plug-in to create a monitoring configuration for {{site.data.keyword.mon_full_notm}} in your cluster, and use this monitoring configuration to automatically collect and forward metrics to {{site.data.keyword.mon_full_notm}}. {: shortdesc}
With {{site.data.keyword.mon_full_notm}}, you can collect cluster and pod metrics, such as the CPU and memory usage of your worker nodes, incoming and outgoing HTTP traffic for your pods, and data about several infrastructure components. In addition, the agent can collect custom application metrics by using either a Prometheus-compatible scraper or a StatsD facade.
Considerations for using the {{site.data.keyword.containerlong_notm}} observability plug-in:
- You can have only one monitoring configuration for {{site.data.keyword.mon_full_notm}} in your cluster at a time. If you want to use a different {{site.data.keyword.mon_full_notm}} service instance to send metrics to, use the
ibmcloud ob monitoring config replace
command. - If you created a {{site.data.keyword.mon_short}} configuration in your cluster without using the {{site.data.keyword.containerlong_notm}} observability plug-in, you can use the
ibmcloud ob monitoring agent discover
command to make the configuration visible to the plug-in. Then, you can use the observability plug-in commands and functionality in the {{site.data.keyword.cloud_notm}} console to manage the configuration.
Before you begin:
- Verify that you are assigned the Editor platform access role and Manager server access role for {{site.data.keyword.mon_full_notm}}.
- Verify that you are assigned the Administrator platform access role and the Manager service access role for all Kubernetes namespaces in {{site.data.keyword.containerlong_notm}} to create the monitoring configuration. To view a monitoring configuration or launch the {{site.data.keyword.mon_short}} dashboard after the monitoring configuration is created, users must be assigned the Viewer platform access role and Reader service access role for the
ibm-observe
Kubernetes namespace in {{site.data.keyword.containerlong_notm}}. - If you want to use the CLI to set up the monitoring configuration:
To set up a monitoring configuration for your cluster:
-
Create an {{site.data.keyword.mon_full_notm}} service instance and note the name of the instance. The service instance must belong to the same {{site.data.keyword.cloud_notm}} account where you created your cluster, but can be in a different resource group and {{site.data.keyword.cloud_notm}} region than your cluster.
-
Set up a monitoring configuration for your cluster. When you create the monitoring configuration, a Kubernetes namespace
ibm-observe
is created and a {{site.data.keyword.mon_short}} agent is deployed as a Kubernetes daemon set to all worker nodes in your cluster. This agent collects cluster and pod metrics, such as the worker node CPU and memory usage, or the amount incoming and outgoing network traffic to your pods.-
**From the console: **
- From the {{site.data.keyword.containerlong_notm}} console{: external}, select the cluster for which you want to create a {{site.data.keyword.mon_short}} configuration.
- On the cluster Overview page, click Connect.
- Select the region and the {{site.data.keyword.mon_full_notm}} service instance that you created earlier, and click Connect.
-
From the CLI:
-
Create the {{site.data.keyword.mon_short}} configuration. When you create the {{site.data.keyword.mon_short}} configuration, the access key that was last added is retrieved automatically. If you want to use a different access key, add the
--sysdig-access-key <access_key>
option to the command.To use a different service access key after you created the monitoring configuration, use the
ibmcloud ob monitoring config replace
command. {: tip}ibmcloud ob monitoring config create --cluster <cluster_name_or_ID> --instance <Monitoring_instance_name_or_ID>
{: pre}
Example output:
Creating configuration... OK
{: screen}
-
Verify that the monitoring configuration was added to your cluster.
ibmcloud ob monitoring config list --cluster <cluster_name_or_ID>
{: pre}
Example output:
Listing configurations... OK Instance Name Instance ID CRN IBM Cloud Monitoring-aaa 1a111a1a-1111-11a1-a1aa-aaa11111a11a crn:v1:prod:public:sysdig:us-south:a/a11111a1aaaaa11a111aa11a1aa1111a:1a111a1a-1111-11a1-a1aa-aaa11111a11a::
{: screen}
-
-
-
Optional: Verify that the {{site.data.keyword.mon_short}} agent was set up successfully.
-
If you used the console to create the {{site.data.keyword.mon_short}} configuration, log in to your cluster. For more information, see Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
-
Verify that the daemon set for the {{site.data.keyword.mon_short}} agent was created and all instances are listed as
AVAILABLE
.kubectl get daemonsets -n ibm-observe
{: pre}
Example output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE sysdig-agent 9 9 9 9 9 <none> 14m
{: screen}
The number of daemon set instances that are deployed equals the number of worker nodes in your cluster.
-
Review the configmap that was created for your {{site.data.keyword.mon_short}} agent.
kubectl describe configmap -n ibm-observe
{: pre}
-
-
Access the metrics for your pods and cluster from the {{site.data.keyword.mon_short}} dashboard.
- From the {{site.data.keyword.containerlong_notm}} console{: external}, select the cluster that you configured.
- On the cluster Overview page, click Launch. The {{site.data.keyword.mon_short}} dashboard opens.
- Review the pod and cluster metrics that the {{site.data.keyword.mon_short}} agent collected from your cluster. It might take a few minutes for your first metrics to show.
-
Review how you can work with the {{site.data.keyword.mon_short}} dashboard to further analyze your metrics.
{: #states}
Review the state of a Kubernetes cluster to get information about the availability and capacity of the cluster, and potential problems that might occur. {: shortdesc}
To view information about a specific cluster, such as its zones, service endpoint URLs, Ingress subdomain, version, and owner, use the ibmcloud ks cluster get --cluster <cluster_name_or_ID>
command. Include the --show-resources
flag to view more cluster resources such as add-ons for storage pods or subnet VLANs for public and private IPs.
You can review information about the overall cluster, the IBM-managed master, and your worker nodes. To troubleshoot your cluster and worker nodes, see Troubleshooting clusters.
{: #states_cluster}
You can view the current cluster state by running the ibmcloud ks cluster ls
command and locating the State field.
{: shortdesc}
Cluster state | Description |
---|---|
Aborted |
The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an {{site.data.keyword.cloud_notm}} support case. |
Critical |
The Kubernetes master cannot be reached or all worker nodes in the cluster are down. If you enabled {{site.data.keyword.keymanagementservicelong_notm}} in your cluster, the {{site.data.keyword.keymanagementserviceshort}} container might fail to encrypt or decrypt your cluster secrets. If so, you can view an error with more information when you run kubectl get secrets . |
Delete failed |
The Kubernetes master or at least one worker node cannot be deleted. List worker nodes by running ibmcloud ks worker ls --cluster <cluster_name_or_ID> . If worker nodes are listed, see Unable to create or delete worker nodes. If no workers are listed, open an {{site.data.keyword.cloud_notm}} support case. |
Deleted |
The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an {{site.data.keyword.cloud_notm}} support case. |
Deleting |
The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster. |
Deploy failed |
The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an {{site.data.keyword.cloud_notm}} support case. |
Deploying |
The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster. |
Normal |
All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as networking and storage, might still need attention. If you just created the cluster, some parts of the cluster that are used by other services such as Ingress secrets or registry image pull secrets, might still be in process. |
Pending |
The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster. |
Requested |
A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to Deploying . If your cluster is stuck in the Requested state for a long time, open an {{site.data.keyword.cloud_notm}} support case. |
Updating |
The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster. |
Unsupported |
The Kubernetes version that the cluster runs is no longer supported. Your cluster's health is no longer actively monitored or reported. Additionally, you cannot add or reload worker nodes. To continue receiving important security updates and support, you must update your cluster. Review the version update preparation actions, then update your cluster to a supported Kubernetes version. |
Warning |
|
{: #states_master}
Your {{site.data.keyword.containerlong_notm}} cluster includes an IBM-managed master with highly available replicas, automatic security patch updates applied for you, and automation in place to recover in case of an incident. You can check the health, status, and state of the cluster master by running ibmcloud ks cluster get --cluster <cluster_name_or_ID>
.
{: shortdesc}
Master Health
The Master Health reflects the state of master components and notifies you if something needs your attention. The health might be one of the following:
error
: The master is not operational. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master isnormal
. You can also open an {{site.data.keyword.cloud_notm}} support case.normal
: The master is operational and healthy. No action is required.unavailable
: The master might not be accessible, which means some actions such as resizing a worker pool are temporarily unavailable. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master isnormal
.unsupported
: The master runs an unsupported version of Kubernetes. You must update your cluster to return the master tonormal
health.
Master Status and State
The Master Status provides details of what operation from the master state is in progress. The status includes a timestamp of how long the master has been in the same state, such as Ready (1 month ago)
. The Master State reflects the lifecycle of possible operations that can be performed on the master, such as deploying, updating, and deleting. Each state is described in the following table.
Master state | Description |
---|---|
deployed |
The master is successfully deployed. Check the status to verify that the master is Ready or to see if an update is available. |
deploying |
The master is currently deploying. Wait for the state to become deployed before working with your cluster, such as adding worker nodes. |
deploy_failed |
The master failed to deploy. IBM Support is notified and works to resolve the issue. Check the Master Status field for more information, or wait for the state to become deployed . |
deleting |
The master is currently deleting because you deleted the cluster. You cannot undo a deletion. After the cluster is deleted, you can no longer check the master state because the cluster is completely removed. |
delete_failed |
The master failed to delete. IBM Support is notified and works to resolve the issue. You cannot resolve the issue by trying to delete the cluster again. Instead, check the Master Status field for more information, or wait for the cluster to delete. You can also open an {{site.data.keyword.cloud_notm}} support case. |
updating |
The master is updating its Kubernetes version. The update might be a patch update that is automatically applied, or a minor or major version that you applied by updating the cluster. During the update, your highly available master can continue processing requests, and your app workloads and worker nodes continue to run. After the master update is complete, you can update your worker nodes. If the update is unsuccessful, the master returns to a deployed state and continues running the previous version. IBM Support is notified and works to resolve the issue. You can check if the update failed in the Master Status field. |
update_cancelled |
The master update is canceled because the cluster was not in a healthy state at the time of the update. Your master remains in this state until your cluster is healthy and you manually update the master. To update the master, use the ibmcloud ks cluster master update command. If you do not want to update the master to the default major.minor version during the update, include the --version flag and specify the latest patch version that is available for the major.minor version that you want, such as 1.20.6 . To list available versions, run ibmcloud ks versions . |
update_failed |
The master update failed. IBM Support is notified and works to resolve the issue. You can continue to monitor the health of the master until the master reaches a normal state. If the master remains in this state for more than 1 day, open an {{site.data.keyword.cloud_notm}} support case. IBM Support might identify other issues in your cluster that you must fix before the master can be updated. |
{: caption="Master states"} | |
{: summary="Table rows read from left to right, with the master state in column one and a description in column two."} |
{: #states_workers}
You can view the current worker node state by running the ibmcloud ks worker ls --cluster <cluster_name_or_ID>
command and locating the State and Status fields.
{: shortdesc}
Worker node state | Description |
---|---|
Critical |
A worker node can go into a Critical state for many reasons:
If reloading the worker node does not resolve the issue, go to the next step to continue troubleshooting your worker node. You can configure health checks for your worker node and enable Autorecovery. If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like rebooting a VPC worker node or reloading the operating system on a classic worker node. For more information about how Autorecovery works, see the Autorecovery blog . |
Deleting |
You requested to delete the worker node, possibly as part of resizing a worker pool or autoscaling the cluster. Other operations cannot be issued against the worker node while the worker node deletes. You cannot reverse the deletion process. When the deletion process completes, you are no longer billed for the worker nodes. |
Deleted |
Your worker node is deleted, and no longer is listed in the cluster or billed. This state cannot be undone. Any data that was stored only on the worker node, such as container images, are also deleted. |
Deployed |
Updates are successfully deployed to your worker node. After updates are deployed, {{site.data.keyword.containerlong_notm}} starts a health check on the worker node. After the health check is successful, the worker node goes into a Normal state. Worker nodes in a Deployed state usually are ready to receive workloads, which you can check by running kubectl get nodes and confirming that the state shows Normal . |
Deploying |
When you update the Kubernetes version of your worker node, your worker node is redeployed to install the updates. If you reload or reboot your worker node, the worker node is redeployed to automatically install the latest patch version. If your worker node is stuck in this state for a long time, check whether a problem occurred during the deployment. |
Deploy_failed |
Your worker node could not be deployed. List the details for the worker node to find the details for the failure by running ibmcloud ks worker get --cluster <cluster_name_or_id> --worker <worker_node_id> . |
Normal |
Your worker node is fully provisioned and ready to be used in the cluster. This state is considered healthy and does not require an action from the user. Note: Although the worker nodes might be normal, other infrastructure resources, such as networking and storage, might still need attention. |
Provisioned |
Your worker node completed provisioning and is part of the cluster. Billing for the worker node begins. The worker node state soon reports a regular health state and status, such as normal and ready . |
Provisioning |
Your worker node is being provisioned and is not available in the cluster yet. You can monitor the provisioning process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, check whether a problem occurred during the provisioning. |
Provision pending |
Another process is completing before the worker node provisioning process starts. You can monitor the other process that must complete first in the Status column of your CLI output. For example, in VPC clusters, the Pending security group creation indicates that the security group for your worker nodes is creating first before the worker nodes can be provisioned. If your worker node is stuck in this state for a long time, check whether a problem occurred during the other process. |
Provision_failed |
Your worker node could not be provisioned. List the details for the worker node to find the details for the failure by running ibmcloud ks worker get --cluster <cluster_name_or_id> --worker <worker_node_id> . |
Reloading |
Your worker node is being reloaded and is not available in the cluster. You can monitor the reloading process in the Status column of your CLI output. If your worker node is stuck in this state for a long time, check whether a problem occurred during the reloading. |
Reloading_failed |
Your worker node could not be reloaded. List the details for the worker node to find the details for the failure by running ibmcloud ks worker get --cluster <cluster_name_or_id> --worker <worker_node_id> . |
Reload_pending |
A request to reload or to update the Kubernetes version of your worker node is sent. When the worker node is being reloaded, the state changes to Reloading . |
Unknown |
The Kubernetes master is not reachable for one of the following reasons:
|
Warning |
Your worker node is reaching the limit for memory or disk space. You can either reduce work load on your worker node or add a worker node to your cluster to help load balance the work load. |
{: #autorecovery}
The Autorecovery system uses various checks to query worker node health status. If Autorecovery detects an unhealthy worker node based on the configured checks, Autorecovery triggers a corrective action like rebooting a VPC worker node or reloading the operating system in a classic worker node. Only one worker node undergoes a corrective action at a time. The worker node must successfully complete the corrective action before any other worker node undergoes a corrective action. For more information, see this Autorecovery blog post{: external}. {: shortdesc}
Autorecovery requires at least one healthy worker node to function properly. Configure Autorecovery with active checks only in clusters with two or more worker nodes. {: note}
Before you begin:
- Ensure that you have the following {{site.data.keyword.cloud_notm}} IAM roles:
- Administrator platform access role for the cluster
- Writer or Manager service access role for the
kube-system
namespace
- Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
To configure Autorecovery:
-
Follow the instructions to install the Helm version 3 client on your local machine.
-
Create a configuration map file that defines your checks in JSON format. For example, the following YAML file defines three checks: an HTTP check and two Kubernetes API server checks. Refer to the tables following the example YAML file for information about the three kinds of checks and information about the individual components of the checks.
*Define each check as a unique key in the
data
section of the configuration map.kind: ConfigMap apiVersion: v1 metadata: name: ibm-worker-recovery-checks namespace: kube-system data: checknode.json: | { "Check":"KUBEAPI", "Resource":"NODE", "FailureThreshold":3, "CorrectiveAction":"RELOAD", "CooloffSeconds":1800, "IntervalSeconds":180, "TimeoutSeconds":10, "Enabled":true } checkpod.json: | { "Check":"KUBEAPI", "Resource":"POD", "PodFailureThresholdPercent":50, "FailureThreshold":3, "CorrectiveAction":"RELOAD", "CooloffSeconds":1800, "IntervalSeconds":180, "TimeoutSeconds":10, "Enabled":true } checkhttp.json: | { "Check":"HTTP", "FailureThreshold":3, "CorrectiveAction":"REBOOT", "CooloffSeconds":1800, "IntervalSeconds":180, "TimeoutSeconds":10, "Port":80, "ExpectedStatus":200, "Route":"/myhealth", "Enabled":false }
{:codeblock}
Understanding the configmap components Component Description name
The configuration name ibm-worker-recovery-checks
is a constant and cannot be changed.namespace
The kube-system
namespace is a constant and cannot be changed.checknode.json
Defines a Kubernetes API node check that checks whether each worker node is in the Ready
state. The check for a specific worker node counts as a failure if the worker node is not in theReady
state. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is reloaded. This action is equivalent to runningibmcloud ks worker reload
.
The node check is enabled until you set the Enabled field tofalse
or remove the check.Reloading is supported only for worker nodes on classic infrastructure.
checkpod.json
Defines a Kubernetes API pod check that checks the total percentage of NotReady
pods on a worker node based on the total pods that are assigned to that worker node. The check for a specific worker node counts as a failure if the total percentage ofNotReady
pods is greater than the definedPodFailureThresholdPercent
. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is reloaded. This action is equivalent to runningibmcloud ks worker reload
. For example, the defaultPodFailureThresholdPercent
is 50%. If the percentage ofNotReady
pods is greater than 50% three consecutive times, the worker node is reloaded.
By default, pods in all namespaces are checked. To restrict the check to only pods in a specified namespace, add theNamespace
field to the check. The pod check is enabled until you set the Enabled field tofalse
or remove the check.Reloading is supported only for worker nodes on classic infrastructure.
checkhttp.json
Defines an HTTP check that checks if an HTTP server that runs on your worker node is healthy. To use this check, you must deploy an HTTP server on every worker node in your cluster by using a [daemon set ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/). You must implement a health check that is available at the /myhealth
path and that can verify whether your HTTP server is healthy. You can define other paths by changing theRoute
parameter. If the HTTP server is healthy, you must return the HTTP response code that is defined inExpectedStatus
. The HTTP server must be configured to listen on the private IP address of the worker node. You can find the private IP address by runningkubectl get nodes
.
For example, consider two nodes in a cluster that have the private IP addresses 10.10.10.1 and 10.10.10.2. In this example, two routes are checked for a 200 HTTP response:http://10.10.10.1:80/myhealth
andhttp://10.10.10.2:80/myhealth
. The check in the example YAML runs every 3 minutes. If it fails three consecutive times, the worker node is rebooted. This action is equivalent to runningibmcloud ks worker reboot
.
The HTTP check is disabled until you set the Enabled field totrue
.Understanding the individual components of checks Component Description Check
Enter the type of check that you want Autorecovery to use. HTTP
: Autorecovery calls HTTP servers that run on each node to determine whether the nodes are running properly.KUBEAPI
: Autorecovery calls the Kubernetes API server and reads the health status data reported by the worker nodes.
Resource
When the check type is KUBEAPI
, enter the type of resource that you want Autorecovery to check. Accepted values areNODE
orPOD
.FailureThreshold
Enter the threshold for the number of consecutive failed checks. When this threshold is met, Autorecovery triggers the specified corrective action. For example, if the value is 3 and Autorecovery fails a configured check three consecutive times, Autorecovery triggers the corrective action that is associated with the check. PodFailureThresholdPercent
When the resource type is POD
, enter the threshold for the percentage of pods on a worker node that can be in a [NotReady
![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes) state. This percentage is based on the total number of pods that are scheduled to a worker node. When a check determines that the percentage of unhealthy pods is greater than the threshold, the check counts as one failure.CorrectiveAction
Enter the action to run when the failure threshold is met. A corrective action runs only while no other workers are being repaired and when this worker node is not in a cool-off period from a previous action. REBOOT
: Reboots the worker node.RELOAD
: Reloads all of the necessary configurations for the worker node from a clean OS.
CooloffSeconds
Enter the number of seconds Autorecovery must wait to issue another corrective action for a node that was already issued a corrective action. The cool off period starts at the time a corrective action is issued. IntervalSeconds
Enter the number of seconds in between consecutive checks. For example, if the value is 180, Autorecovery runs the check on each node every 3 minutes. TimeoutSeconds
Enter the maximum number of seconds that a check call to the database takes before Autorecovery terminates the call operation. The value for TimeoutSeconds
must be less than the value forIntervalSeconds
.Port
When the check type is HTTP
, enter the port that the HTTP server must bind to on the worker nodes. This port must be exposed on the IP of every worker node in the cluster. Autorecovery requires a constant port number across all nodes for checking servers. Use [daemon sets ![External link icon](../icons/launch-glyph.svg "External link icon")](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) when you deploy a custom server into a cluster.ExpectedStatus
When the check type is HTTP
, enter the HTTP server status that you expect to be returned from the check. For example, a value of 200 indicates that you expect anOK
response from the server.Route
When the check type is HTTP
, enter the path that is requested from the HTTP server. This value is typically the metrics path for the server that runs on all of the worker nodes.Enabled
Enter true
to enable the check orfalse
to disable the check.Namespace
Optional: To restrict checkpod.json
to checking only pods in one namespace, add theNamespace
field and enter the namespace. -
Create the configuration map in your cluster.
kubectl apply -f ibm-worker-recovery-checks.yaml
{: pre}
-
Verify that you created the configuration map with the name
ibm-worker-recovery-checks
in thekube-system
namespace with the proper checks.kubectl -n kube-system get cm ibm-worker-recovery-checks -o yaml
{: pre}
-
Deploy Autorecovery into your cluster by installing the
ibm-worker-recovery
Helm chart.helm install ibm-worker-recovery iks-charts/ibm-worker-recovery --namespace kube-system
{: pre}
-
After a few minutes, you can check the
Events
section in the output of the following command to see activity on the Autorecovery deployment.kubectl -n kube-system describe deployment ibm-worker-recovery
{: pre}
-
If you do not see activity on the Autorecovery deployment, you can check the Helm deployment by running the tests that are included in the Autorecovery chart definition.
helm test ibm-worker-recovery -n kube-system
{: pre}