copyright | lastupdated | keywords | subcollection | content-type | ||
---|---|---|---|---|---|---|
|
2021-04-28 |
kubernetes, iks, help, debug |
containers |
troubleshoot |
{:DomainName: data-hd-keyref="APPDomain"} {:DomainName: data-hd-keyref="DomainName"} {:android: data-hd-operatingsystem="android"} {:api: .ph data-hd-interface='api'} {:apikey: data-credential-placeholder='apikey'} {:app_key: data-hd-keyref="app_key"} {:app_name: data-hd-keyref="app_name"} {:app_secret: data-hd-keyref="app_secret"} {:app_url: data-hd-keyref="app_url"} {:authenticated-content: .authenticated-content} {:beta: .beta} {:c#: data-hd-programlang="c#"} {:cli: .ph data-hd-interface='cli'} {:codeblock: .codeblock} {:curl: .ph data-hd-programlang='curl'} {:deprecated: .deprecated} {:dotnet-standard: .ph data-hd-programlang='dotnet-standard'} {:download: .download} {:external: target="_blank" .external} {:faq: data-hd-content-type='faq'} {:fuzzybunny: .ph data-hd-programlang='fuzzybunny'} {:generic: data-hd-operatingsystem="generic"} {:generic: data-hd-programlang="generic"} {:gif: data-image-type='gif'} {:go: .ph data-hd-programlang='go'} {:help: data-hd-content-type='help'} {:hide-dashboard: .hide-dashboard} {:hide-in-docs: .hide-in-docs} {:important: .important} {:ios: data-hd-operatingsystem="ios"} {:java: .ph data-hd-programlang='java'} {:java: data-hd-programlang="java"} {:javascript: .ph data-hd-programlang='javascript'} {:javascript: data-hd-programlang="javascript"} {:new_window: target="_blank"} {:note .note} {:note: .note} {:objectc data-hd-programlang="objectc"} {:org_name: data-hd-keyref="org_name"} {:php: data-hd-programlang="php"} {:pre: .pre} {:preview: .preview} {:python: .ph data-hd-programlang='python'} {:python: data-hd-programlang="python"} {:route: data-hd-keyref="route"} {:row-headers: .row-headers} {:ruby: .ph data-hd-programlang='ruby'} {:ruby: data-hd-programlang="ruby"} {:runtime: architecture="runtime"} {:runtimeIcon: .runtimeIcon} {:runtimeIconList: .runtimeIconList} {:runtimeLink: .runtimeLink} {:runtimeTitle: .runtimeTitle} {:screen: .screen} {:script: data-hd-video='script'} {:service: architecture="service"} {:service_instance_name: data-hd-keyref="service_instance_name"} {:service_name: data-hd-keyref="service_name"} {:shortdesc: .shortdesc} {:space_name: data-hd-keyref="space_name"} {:step: data-tutorial-type='step'} {:subsection: outputclass="subsection"} {:support: data-reuse='support'} {:swift: .ph data-hd-programlang='swift'} {:swift: data-hd-programlang="swift"} {:table: .aria-labeledby="caption"} {:term: .term} {:tip: .tip} {:tooling-url: data-tooling-url-placeholder='tooling-url'} {:troubleshoot: data-hd-content-type='troubleshoot'} {:tsCauses: .tsCauses} {:tsResolve: .tsResolve} {:tsSymptoms: .tsSymptoms} {:tutorial: data-hd-content-type='tutorial'} {:ui: .ph data-hd-interface='ui'} {:unity: .ph data-hd-programlang='unity'} {:url: data-credential-placeholder='url'} {:user_ID: data-hd-keyref="user_ID"} {:vbnet: .ph data-hd-programlang='vb.net'} {:video: .video}
{: #cs_troubleshoot}
As you use {{site.data.keyword.containerlong}}, consider these techniques for general troubleshooting and debugging your cluster and cluster master. {: shortdesc}
General ways to resolve issues
- Keep your cluster environment up to date.
- Check monthly for available security and operating system patches to update your worker nodes.
- Update your cluster to the latest default version for {{site.data.keyword.containershort}}.
- Make sure that your command line tools are up to date.
- In the command line, you are notified when updates to the
ibmcloud
CLI and plug-ins are available. Be sure to keep your CLI up-to-date so that you can use all available commands and flags. - Make sure that your
kubectl
CLI client matches the same Kubernetes version as your cluster server. Kubernetes does not support{: external}kubectl
client versions that are 2 or more versions apart from the server version (n +/- 2).
- In the command line, you are notified when updates to the
Reviewing issues and status
- To see whether {{site.data.keyword.cloud_notm}} is available, check the {{site.data.keyword.cloud_notm}} status page{: external}.
- Filter for the Kubernetes Service component.
{: #debug_utility} {: troubleshoot} {: support}
While you troubleshoot, you can use the {{site.data.keyword.containerlong_notm}} Diagnostics and Debug Tool to run tests and gather pertinent information from your cluster. {: shortdesc}
Infrastructure provider:
Before you begin:
If you previously installed the debug tool by using Helm, first uninstall the ibmcloud-iks-debug
Helm chart.
- Find the installation name of your Helm chart.
helm list -n <namespace> | grep ibmcloud-iks-debug
{: pre}
Example output:
<helm_chart_name> 1 Thu Sep 13 16:41:44 2019 DEPLOYED ibmcloud-iks-debug-1.0.0 default
{: screen}
- Uninstall the debug tool installation by deleting the Helm chart.
helm uninstall <helm_chart_name> -n <namespace>
{: pre}
- Verify that the debug tool pods are removed. When the uninstallation is complete, no pods are returned by the following command.
kubectl get pod --all-namespaces | grep ibmcloud-iks-debug
{: pre}
To enable and use the Diagnostics and Debug Tool add-on:
-
In your cluster dashboard{: external}, click the name of the cluster where you want to install the debug tool add-on.
-
Click the Add-ons tab.
-
On the Diagnostics and Debug Tool card, click Install.
-
In the dialog box, click Install. Note that it can take a few minutes for the add-on to be installed.
To resolve some common issues that you might encounter during the add-on deployment, see Reviewing add-on state and statuses.
-
On the Diagnostics and Debug Tool card, click Dashboard.
-
In the debug tool dashboard, select individual tests or a group of tests to run. Some tests check for potential warnings, errors, or issues, and some tests only gather information that you can reference while you troubleshoot. For more information about the function of each test, click the information icon next to the test's name.
-
Click Run.
-
Check the results of each test.
- If any test fails, click the information icon next to the test's name in the left column for information about how to resolve the issue.
- You can also use the results of tests to gather information, such as complete YAMLs, that can help you debug your cluster in the following sections.
{: #debug_clusters} {: troubleshoot} {: support}
Review the options to debug your clusters and find the root causes for failures. {: shortdesc}
Infrastructure provider:
- List your cluster and find the
State
of the cluster.
ibmcloud ks cluster ls
{: pre}
-
Review the
State
of your cluster. If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, start debugging the worker nodes.You can view the current cluster state by running the
ibmcloud ks cluster ls
command and locating the State field. {: shortdesc}Cluster states Cluster state Description `Aborted` The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-get-help). `Critical` The Kubernetes master cannot be reached or all worker nodes in the cluster are down. If you enabled {{site.data.keyword.keymanagementservicelong_notm}} in your cluster, the {{site.data.keyword.keymanagementserviceshort}} container might fail to encrypt or decrypt your cluster secrets. If so, you can view an error with more information when you run `kubectl get secrets`. `Delete failed` The Kubernetes master or at least one worker node cannot be deleted. List worker nodes by running `ibmcloud ks worker ls --cluster `. If worker nodes are listed, see [Unable to create or delete worker nodes](/docs/containers?topic=containers-cs_troubleshoot_clusters#infra_errors). If no workers are listed, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-get-help). `Deleted` The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-get-help). `Deleting` The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster. `Deploy failed` The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-get-help). `Deploying` The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster. `Normal` All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as [networking](/docs/containers?topic=containers-cs_troubleshoot_network) and [storage](/docs/containers?topic=containers-cs_troubleshoot_storage), might still need attention. If you just created the cluster, some parts of the cluster that are used by other services such as Ingress secrets or registry image pull secrets, might still be in process.
`Pending` The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster. `Requested` A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to Deploying
. If your cluster is stuck in theRequested
state for a long time, open an [{{site.data.keyword.cloud_notm}} support case](/docs/containers?topic=containers-get-help).`Updating` The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster. `Unsupported` The [Kubernetes version](/docs/containers?topic=containers-cs_versions#cs_versions) that the cluster runs is no longer supported. Your cluster's health is no longer actively monitored or reported. Additionally, you cannot add or reload worker nodes. To continue receiving important security updates and support, you must update your cluster. Review the [version update preparation actions](/docs/containers?topic=containers-cs_versions#prep-up), then [update your cluster](/docs/containers?topic=containers-update#update) to a supported Kubernetes version. `Warning` - At least one worker node in the cluster is not available, but other worker nodes are available and can take over the workload. Try to [reload](/docs/containers?topic=containers-cli-plugin-kubernetes-service-cli#cs_worker_reload) the unavailable worker nodes.
- Your cluster has zero worker nodes, such as if you created a cluster without any worker nodes or manually removed all the worker nodes from the cluster. [Resize your worker pool](/docs/containers?topic=containers-add_workers#resize_pool) to add worker nodes to recover from a `Warning` state, and then [update the Calico node entries for your worker nodes](/docs/containers?topic=containers-cs_troubleshoot_clusters#zero_nodes_calico_failure).
- A control plane operation for your cluster failed. View the cluster in the console or run `ibmcloud ks cluster get --cluster ` to [check the **Master Status** for further debugging](/docs/containers?topic=containers-cs_troubleshoot#debug_master).
The [Kubernetes master](/docs/containers?topic=containers-service-arch) is the main component that keeps your cluster up and running. The master stores cluster resources and their configurations in the etcd database that serves as the single point of truth for your cluster. The Kubernetes API server is the main entry point for all cluster management requests from the worker nodes to the master, or when you want to interact with your cluster resources.
If a master failure occurs, your workloads continue to run on the worker nodes, but you cannot use `kubectl` commands to work with your cluster resources or view the cluster health until the Kubernetes API server in the master is back up. If a pod goes down during the master outage, the pod cannot be rescheduled until the worker node can reach the Kubernetes API server again.
During a master outage, you can still run `ibmcloud ks` commands against the {{site.data.keyword.containerlong_notm}} API to work with your infrastructure resources, such as worker nodes or VLANs. If you change the current cluster configuration by adding or removing worker nodes to the cluster, your changes do not happen until the master is back up.
Do not restart or reboot a worker node during a master outage. This action removes the pods from your worker node. Because the Kubernetes API server is unavailable, the pods cannot be rescheduled onto other worker nodes in the cluster.
{: #debug_master} {: troubleshoot} {: support}
Infrastructure provider:
Your {{site.data.keyword.containerlong_notm}} cluster includes an IBM-managed master with highly available replicas, automatic security patch updates applied for you, and automation in place to recover in case of an incident. You can check the health, status, and state of the cluster master by running ibmcloud ks cluster get --cluster <cluster_name_or_ID>
.
{: shortdesc}
Master Health
The Master Health reflects the state of master components and notifies you if something needs your attention. The health might be one of the following:
error
: The master is not operational. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master isnormal
. You can also open an {{site.data.keyword.cloud_notm}} support case.normal
: The master is operational and healthy. No action is required.unavailable
: The master might not be accessible, which means some actions such as resizing a worker pool are temporarily unavailable. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master isnormal
.unsupported
: The master runs an unsupported version of Kubernetes. You must update your cluster to return the master tonormal
health.
Master Status and State
The Master Status provides details of what operation from the master state is in progress. The status includes a timestamp of how long the master has been in the same state, such as Ready (1 month ago)
. The Master State reflects the lifecycle of possible operations that can be performed on the master, such as deploying, updating, and deleting. Each state is described in the following table.
Master state | Description |
---|---|
deployed |
The master is successfully deployed. Check the status to verify that the master is Ready or to see if an update is available. |
deploying |
The master is currently deploying. Wait for the state to become deployed before working with your cluster, such as adding worker nodes. |
deploy_failed |
The master failed to deploy. IBM Support is notified and works to resolve the issue. Check the Master Status field for more information, or wait for the state to become deployed . |
deleting |
The master is currently deleting because you deleted the cluster. You cannot undo a deletion. After the cluster is deleted, you can no longer check the master state because the cluster is completely removed. |
delete_failed |
The master failed to delete. IBM Support is notified and works to resolve the issue. You cannot resolve the issue by trying to delete the cluster again. Instead, check the Master Status field for more information, or wait for the cluster to delete. You can also open an {{site.data.keyword.cloud_notm}} support case. |
updating |
The master is updating its Kubernetes version. The update might be a patch update that is automatically applied, or a minor or major version that you applied by updating the cluster. During the update, your highly available master can continue processing requests, and your app workloads and worker nodes continue to run. After the master update is complete, you can update your worker nodes. If the update is unsuccessful, the master returns to a deployed state and continues running the previous version. IBM Support is notified and works to resolve the issue. You can check if the update failed in the Master Status field. |
update_cancelled |
The master update is canceled because the cluster was not in a healthy state at the time of the update. Your master remains in this state until your cluster is healthy and you manually update the master. To update the master, use the ibmcloud ks cluster master update command. If you do not want to update the master to the default major.minor version during the update, include the --version flag and specify the latest patch version that is available for the major.minor version that you want, such as 1.20.6 . To list available versions, run ibmcloud ks versions . |
update_failed |
The master update failed. IBM Support is notified and works to resolve the issue. You can continue to monitor the health of the master until the master reaches a normal state. If the master remains in this state for more than 1 day, open an {{site.data.keyword.cloud_notm}} support case. IBM Support might identify other issues in your cluster that you must fix before the master can be updated. |
{: caption="Master states"} | |
{: summary="Table rows read from left to right, with the master state in column one and a description in column two."} |
{: #ts_clis} {: troubleshoot} {: support}
Review the following common reasons for CLI connection issues or command failures. {: shortdesc}
Infrastructure provider:
{: #ts_firewall_clis}
{: tsSymptoms}
When you run ibmcloud
, kubectl
, or calicoctl
commands from the CLI, they fail.
{: tsCauses} You might have corporate network policies that prevent access from your local system to public endpoints via proxies or firewalls.
{: tsResolve} Allow TCP access for the CLI commands to work. This task requires the Administrator {{site.data.keyword.cloud_notm}} IAM platform access role for the cluster.
{: #kubectl_fails}
{: tsSymptoms}
When you run kubectl
commands against your cluster, your commands fail with an error message similar to the following.
No resources found.
Error from server (NotAcceptable): unknown (get nodes)
{: screen}
invalid object doesn't have additional properties
{: screen}
error: No Auth Provider found for name "oidc"
{: screen}
{: tsCauses}
You have a different version of kubectl
than your cluster version. Kubernetes does not support{: external} kubectl
client versions that are 2 or more versions apart from the server version (n +/- 2). If you use a community Kubernetes cluster, you might also have the {{site.data.keyword.openshiftshort}} version of kubectl
, which does not work with community Kubernetes clusters.
To check your client kubectl
version against the cluster server version, run kubectl version --short
.
{: tsResolve}
Install the version of kubectl
that matches the Kubernetes version of your cluster.
If you have multiple clusters at different Kubernetes versions or different container platforms such as {{site.data.keyword.openshiftshort}}, download each kubectl
version binary file to a separate directory. Then, you can set up an alias in your local command-line interface (CLI) profile to point to the kubectl
binary file directory that matches the kubectl
version of the cluster that you want to work with, or you might be able to use a tool such as brew switch kubernetes-cli <major.minor>
.
{: #exec_logs_fail}
{: tsSymptoms}
If you run commands such as kubectl exec
, kubectl attach
, kubectl proxy
, kubectl port-forward
, or kubectl logs
, you see the following message.
<workerIP>:10250: getsockopt: connection timed out
{: screen}
{: tsCauses} The OpenVPN connection between the master node and worker nodes is not functioning properly.
{: tsResolve}
- In classic clusters, if you have multiple VLANs for your cluster, multiple subnets on the same VLAN, or a multizone classic cluster, you must enable a Virtual Router Function (VRF) for your IBM Cloud infrastructure account so your worker nodes can communicate with each other on the private network. To enable VRF, see Enabling VRF. To check whether a VRF is already enabled, use the
ibmcloud account show
command. If you cannot or do not want to enable VRF, enable VLAN spanning. To perform this action, you need the Network > Manage Network VLAN Spanning infrastructure permission, or you can request the account owner to enable it. To check whether VLAN spanning is already enabled, use theibmcloud ks vlan spanning get --region <region>
command. - Restart the OpenVPN client pod.
kubectl delete pod -n kube-system -l app=vpn
{: pre} 3. If you still see the same error message, then the worker node that the VPN pod is on might be unhealthy. To restart the VPN pod and reschedule it to a different worker node, cordon, drain, and reboot the worker node that the VPN pod is on.
{: #infra_errors} {: troubleshoot} {: support}
You cannot perform infrastructure-related commands on your cluster, such as:
- Adding worker nodes in an existing cluster or when creating a new cluster
- Removing worker nodes
- Reloading or rebooting worker nodes
- Resizing worker pools
- Updating your cluster
- Deleting your cluster
Review the error messages in the following sections to troubleshoot infrastructure-related issues that are caused by incorrect cluster permissions, orphaned clusters in other infrastructure accounts, or a time-based one-time passcode (TOTP) on the account.
{: #cs_credentials}
{: tsSymptoms} You cannot manage worker nodes for your cluster, and you receive an error message similar to one of the following.
We were unable to connect to your IBM Cloud infrastructure account.
Creating a standard cluster requires that you have either a
Pay-As-You-Go account that is linked to an IBM Cloud infrastructure
account term or that you have used the {{site.data.keyword.containerlong_notm}}
CLI to set your {{site.data.keyword.cloud_notm}} Infrastructure API keys.
{: screen}
'Item' must be ordered with permission.
{: screen}
The worker node instance '<ID>' cannot be found. Review '<provider>' infrastructure user permissions.
{: screen}
The worker node instance cannot be found. Review '<provider>' infrastructure user permissions.
{: screen}
The worker node instance cannot be identified. Review '<provider>' infrastructure user permissions.
{: screen}
The IAM token exchange request failed with the message: <message>
IAM token exchange request failed: <message>
{: screen}
The cluster could not be configured with the registry. Make sure that you have the Administrator role for {{site.data.keyword.registrylong_notm}}.
{: screen}
{: tsCauses} The infrastructure credentials that are set for the region and resource group are missing the appropriate infrastructure permissions. The user's infrastructure permissions are most commonly stored as an API key for the region and resource group. More rarely, if you use a different {{site.data.keyword.cloud_notm}} account type, you might have set infrastructure credentials manually.
{: tsResolve} The account owner must set up the infrastructure account credentials properly. The credentials depend on what type of infrastructure account you are using.
Before you begin, Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster..
- Identify what user credentials are used for the region and resource group's infrastructure permissions.
-
Check the API key for a region and resource group of the cluster.
ibmcloud ks api-key info --cluster <cluster_name_or_ID>
{: pre}
Example output:
Getting information about the API key owner for cluster <cluster_name>... OK Name Email <user_name> <[email protected]>
{: screen}
-
Check if the classic infrastructure account for the region and resource group is manually set to use a different IBM Cloud infrastructure account.
ibmcloud ks credential get --region <us-south>
{: pre}
Example output if credentials are set to use a different classic account. In this case, the user's infrastructure credentials are used for the region and resource group that you targeted, even if a different user's credentials are stored in the API key that you retrieved in the previous step.
OK Infrastructure credentials for user name <[email protected]> set for resource group <resource_group_name>.
{: screen}
Example output if credentials are not set to use a different classic account. In this case, the API key owner that you retrieved in the previous step has the infrastructure credentials that are used for the region and resource group.
FAILED No credentials set for resource group <resource_group_name>.: The user credentials could not be found. (E0051)
{: screen}
-
- Validate the infrastructure permissions that the user has.
-
List the suggested and required infrastructure permissions for the region and resource group.
ibmcloud ks infra-permissions get --region <region>
{: pre}
For console and CLI commands to assign these permissions, see Classic infrastructure roles. {: tip}
-
Make sure that the infrastructure credentials owner for the API key or the manually-set account has the correct permissions.
-
If necessary, you can change the API key or manually-set infrastructure credentials owner for the region and resource group.
-
- Test that the changed permissions permit authorized users to perform infrastructure operations for the cluster.
-
For example, you might try to a delete a worker node.
ibmcloud ks worker rm --cluster <cluster_name_or_ID> --worker <worker_node_ID>
{: pre}
-
Check to see if the worker node is removed.
ibmcloud ks worker get --cluster <cluster_name_or_ID> --worker <worker_node_ID>
{: pre}
Example output if the worker node removal is successful. The
worker get
operation fails because the worker node is deleted. The infrastructure permissions are correctly set up.FAILED The specified worker node could not be found. (E0011)
{: screen}
-
If the worker node is not removed, review that State and Status fields and the common issues with worker nodes to continue debugging.
-
If you manually set credentials and still cannot see the cluster's worker nodes in your infrastructure account, you might check whether the cluster is orphaned.
-
{: #orphaned}
Infrastructure provider: Classic
{: tsSymptoms} You cannot manage worker nodes for your cluster, or view the cluster worker nodes in your classic IBM Cloud infrastructure account. However, you can update and manage other clusters in the account.
Further, you verified that you have the proper infrastructure credentials.
You might receive an error message in your worker node status similar to the following.
Incorrect account for worker - The 'classic' infrastructure user credentials changed and no longer match the worker node instance infrastructure account.
{: screen}
{: tsCauses} The cluster might be provisioned in a classic IBM Cloud infrastructure account that is no longer linked to your {{site.data.keyword.containerlong_notm}} account. The cluster is orphaned. Because the resources are in a different account, you do not have the infrastructure credentials to modify the resources.
Consider the following example scenario to understand how clusters might become orphaned.
- You have an {{site.data.keyword.cloud_notm}} Pay-As-You-Go account.
- You create a cluster named
Cluster1
. The worker nodes and other infrastructure resources are provisioned into the infrastructure account that comes with your Pay-As-You-Go account. - Later, you find out that your team uses a legacy or shared classic IBM Cloud infrastructure account. You use the
ibmcloud ks credential set
command to change the IBM Cloud infrastructure credentials to use your team account. - You create another cluster named
Cluster2
. The worker nodes and other infrastructure resources are provisioned into the team infrastructure account. - You notice that
Cluster1
needs a worker node update, a worker node reload, or you just want to clean it up by deleting it. However, becauseCluster1
was provisioned into a different infrastructure account, you cannot modify its infrastructure resources.Cluster1
is orphaned. - You follow the resolution steps in the following section, but do not set your infrastructure credentials back to your team account. You can delete
Cluster1
, but nowCluster2
is orphaned. - You change your infrastructure credentials back to the team account that created
Cluster2
. Now, you no longer have an orphaned cluster!
{: tsResolve}
-
Check which infrastructure account the region that your cluster is in currently uses to provision clusters. Replace
<region>
with the {{site.data.keyword.cloud_notm}} region that the cluster is in.ibmcloud ks credential get --region <region>
{: pre}
If you see a message similar to the following, then the account uses the default, linked infrastructure account.
No credentials set for resource group <resource group>.: The user credentials could not be found.
{: screen}
-
Check which infrastructure account was used to provision the cluster.
- In the Worker Nodes tab, select a worker node and note its ID.
- Open the menu and click Classic Infrastructure.
- From the infrastructure navigation pane, click Devices > Device List.
- Search for the worker node ID that you previously noted.
- If you do not find the worker node ID, the worker node is not provisioned into this infrastructure account. Switch to a different infrastructure account and try again.
-
Compare the infrastructure accounts.
-
If the worker nodes are in the linked infrastructure account: Use the
ibmcloud ks credential unset
command to resume using the default infrastructure credentials that are linked with your Pay-As-You-Go account. -
If the worker nodes are in a different infrastructure account: Use the
ibmcloud ks credential set
command to change your infrastructure credentials to the account that the cluster worker nodes are provisioned in, which you found in the previous step.If you no longer have access to the infrastructure credentials, you can open an {{site.data.keyword.cloud_notm}} support case to determine an email address for the administrator of the other infrastructure account. However, {{site.data.keyword.cloud_notm}} Support cannot remove the orphaned cluster for you, and you must contact the administrator of the other account to get the infrastructure credentials. {: note}
-
If the infrastructure accounts match: Check the rest of the worker nodes in the cluster and see if any has a different infrastructure account. Make sure that you checked the worker nodes in the cluster that has the credentials issue. Review other common infrastructure credential issues.
-
-
Now that the infrastructure credentials are updated, retry the blocked action, such as updating or deleting a worker node, and verify that the action succeeds.
-
If you have other clusters in the same region and resource that require the previous infrastructure credentials, repeat Step 3 to reset the infrastructure credentials to the previous account. Note that if you created clusters with a different infrastructure account than the account that you switch to, you might orphan those clusters.
Tired of switching infrastructure accounts each time you need to perform a cluster or worker action? Consider re-creating all the clusters in the region and resource group in the same infrastructure account. Then, migrate your workloads and remove the old clusters from the different infrastructure account. {: note}
{: #vpe-ts}
Infrastructure provider: VPC Kubernetes version 1.20 or later
{: tsSymptoms} You cannot manage worker nodes for your cluster, and you receive an error message similar to one of the following.
Worker deploy failed due to network communications failing to master or registry endpoints. Please verify your network setup is allowing traffic from this subnet then attempt a worker replace on this worker
{: screen}
Pending endpoint gateway creation
{: screen}
{: tsCauses} In clusters that run Kubernetes version 1.20 or later, worker nodes can communicate with the Kubernetes master through the cluster's virtual private endpoint (VPE). One VPE gateway resource is created per cluster in your VPC. If the VPE gateway for your cluster is not correctly created in your VPC, the VPE gateway is deleted from your VPC, or the IP address that is reserved for the VPE is deleted from your VPC subnet, worker nodes lose connectivity with the Kubernetes master.
{: tsResolve} Re-establish the VPE connection between your worker nodes and Kubernetes master.
- To check the VPE gateway for your cluster in the VPC infrastructure console, open the Virtual private endpoint gateways for VPC dashboard{: external} and look for the VPE gateway in the format
iks-<cluster_ID>
.
- If the gateway for your cluster is not listed, continue to the next step.
- If the gateway for your cluster is listed but its status is not
Stable
, open a support case. In the case details, include the cluster ID. - If the gateway for your cluster is listed and its status is
Stable
, you might have firewall or security group rules that are blocking worker node communication to the cluster master. Configure your security group rules to allow outgoing traffic to the appropriate ports and IP addresses.
-
Refresh the cluster master. If the VPE gateway did not exist in your VPC, it is created, and connectivity to the reserved IP addresses on the subnets that your worker nodes are connected to is re-established. After you refresh the cluster, wait a few minutes to allow the operation to complete.
ibmcloud ks cluster master refresh -c <cluster_name_or_ID>
{: pre}
-
Verify that the VPE gateway for your cluster is created by opening the Virtual private endpoint gateways for VPC dashboard{: external} and looking for the VPE gateway in the format
iks-<cluster_ID>
. -
If you still cannot manage worker nodes after the cluster master is refreshed, replace the worker nodes that you cannot access.
-
List all worker nodes in your cluster and note the name of the worker node that you want to replace.
kubectl get nodes
{: pre}
The name that is returned in this command is the private IP address that is assigned to your worker node. You can find more information about your worker node when you run the
ibmcloud ks worker ls --cluster <cluster_name_or_ID>
command and look for the worker node with the same Private IP address. -
Replace the worker node. As part of the replace process, the pods that run on the worker node are drained and rescheduled onto remaining worker nodes in the cluster. The worker node is also cordoned, or marked as unavailable for future pod scheduling. Use the worker node ID that is returned from the
ibmcloud ks worker ls --cluster <cluster_name_or_ID>
command.ibmcloud ks worker replace --cluster <cluster_name_or_ID> --worker <worker_node_ID>
{: pre}
-
Verify that the worker node is replaced.
ibmcloud ks worker ls --cluster <cluster_name_or_ID>
{: pre}
-
{: #cs_totp}
Infrastructure provider: Classic
{: tsSymptoms} You cannot manage worker nodes for your cluster, and you receive an error message similar to one of the following.
Unable to connect to the IBM Cloud account. Ensure that you have a paid account.
{: screen}
Cannot authenticate the infrastructure user: Time-based One Time Password authentication is required to log in with this user.
{: screen}
{: tsCauses} Your {{site.data.keyword.cloud_notm}} account uses its own automatically linked infrastructure through a Pay-as-you-Go account. However, the account administrator enabled the time-based one-time passcode (TOTP) option so that users are prompted for a time-based one-time passcode (TOTP) at login. This type of multifactor authentication (MFA) is account-based, and affects all access to the account. TOTP MFA also affects the access that {{site.data.keyword.containerlong_notm}} requires to make calls to {{site.data.keyword.cloud_notm}} infrastructure. If TOTP is enabled for the account, you cannot create and manage clusters and worker nodes in {{site.data.keyword.containerlong_notm}}.
{: tsResolve} The {{site.data.keyword.cloud_notm}} account owner or an account administrator must either:
- Disable TOTP for the account, and continue to use the automatically linked infrastructure credentials for {{site.data.keyword.containerlong_notm}}.
- Continue to use TOTP, but create an infrastructure API key that {{site.data.keyword.containerlong_notm}} can use to make direct calls to the {{site.data.keyword.cloud_notm}} infrastructure API.
To disable TOTP MFA for the account:
- Log in to the {{site.data.keyword.cloud_notm}} console{: external}. From the menu bar, select Manage > Access (IAM).
- In the left navigation, click the Settings page.
- Under Multifactor authentication, click Edit.
- Select None, and click Update.
To use TOTP MFA and create an infrastructure API key for {{site.data.keyword.containerlong_notm}}:
- From the {{site.data.keyword.cloud_notm}}{: external} console, select Manage > Access (IAM) > Users and click the name of the account owner. Note: If you do not use the account owner's credentials, first ensure that the user whose credentials you use has the correct permissions.
- In the API Keys section, find or create a classic infrastructure API key.
- Use the infrastructure API key to set the infrastructure API credentials for {{site.data.keyword.containerlong_notm}}. Repeat this command for each region where you create clusters.
{: pre}
ibmcloud ks credential set classic --infrastructure-username <infrastructure_API_username> --infrastructure-api-key <infrastructure_API_authentication_key> --region <region>
- Verify that the correct credentials are set.
{: pre} Example output:
ibmcloud ks credential get --region <region>
{: screen}Infrastructure credentials for user name [email protected] set for resource group default.
- To ensure that existing clusters use the updated infrastructure API credentials, run
ibmcloud ks api-key reset --region <region>
in each region where you have clusters.
{: #ts_no_vpc}
{: tsSymptoms} You try to create a VPC cluster by using the {{site.data.keyword.containerlong_notm}} console{: external}. You have an existing VPC{: external} in your account, but when you try to select an existing Virtual Private Cloud to create the cluster in, you see the following error message:
No VPC is available. Create a VPC.
{: screen}
{: tsCauses}
During cluster creation, the {{site.data.keyword.containerlong_notm}} console uses the API key that is set for the default
resource group to list the VPCs that are available in your {{site.data.keyword.cloud_notm}} account. If no API key is set for the default
resource group, no VPCs are listed in the {{site.data.keyword.containerlong_notm}} console, even if your VPC exists in a different resource group and an API key is set for that resource group.
{: tsResolve}
To set an API key for the default
resource group, use the {{site.data.keyword.containerlong_notm}} CLI.
-
Log in to the command line as the account owner. If you want a different user than the account owner to set the API key, first ensure that the API key owner has the correct permissions.
ibmcloud login [--sso]
{: pre}
-
Target the
default
resource group.ibmcloud target -g default
{:pre}
-
Set the API key for the region and resource group.
ibmcloud ks api-key reset --region <region>
{: pre}
-
In the {{site.data.keyword.containerlong_notm}} console{: external}, click Refresh VPCs. Your available VPCs are now listed in a drop-down menu.
{: #ts_image_pull_create}
Infrastructure provider:
{: tsSymptoms} When you created a cluster, you received an error message similar to the following.
Your cluster cannot pull images from the {{site.data.keyword.registrylong_notm}} 'icr.io' domains because an IAM access policy could not be created. Make sure that you have the IAM Administrator platform access role to {{site.data.keyword.registrylong_notm}}. Then, create an image pull secret with IAM credentials to the registry by running 'ibmcloud ks cluster pull-secret apply'.
{: screen}
{: tsCauses} During cluster creation, a service ID is created for your cluster and assigned the Reader service access policy to {{site.data.keyword.registrylong_notm}}. Then, an API key for this service ID is generated and stored in an image pull secret to authorize the cluster to pull images from {{site.data.keyword.registrylong_notm}}.
To successfully assign the Reader service access policy to the service ID during cluster creation, you must have the Administrator platform access policy to {{site.data.keyword.registrylong_notm}}.
{: tsResolve}
Steps:
- Make sure that the account owner gives you the Administrator role to {{site.data.keyword.registrylong_notm}}.
{: pre}
ibmcloud iam user-policy-create <your_user_email> --service-name container-registry --roles Administrator
- Use the
ibmcloud ks cluster pull-secret apply
command to re-create an image pull secret with the appropriate registry credentials.
{: #webhooks_update}
Infrastructure provider:
{: tsSymptoms} During a master operation such as updating your cluster version, the cluster had a broken webhook application. Now, master operations cannot complete. You see an error similar to the following:
Cannot complete cluster master operations because the cluster has a broken webhook application. For more information, see the troubleshooting docs: 'https://ibm.biz/master_webhook'
{: screen}
{: tsCauses} Your cluster has configurable Kubernetes webhook resources, validating or mutating admission webhooks, that can intercept and modify requests from various services in the cluster to the API server in the cluster master. Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you from updating the master version or other maintenance operations. For more information, see the Dynamic Admission Control{: external} in the Kubernetes documentation.
Potential causes for broken webhooks include:
- The underlying resource that issues the request is missing or unhealthy, such as a Kubernetes service, endpoint, or pod.
- The webhook is part of an add-on or other plug-in application that did not install correctly or is unhealthy.
- Your cluster might have a networking connectivity issue that prevents the webhook from communicating with the Kubernetes API server in the cluster master.
{: tsResolve} Identify and restore the resource that causes the broken webhook.
-
Create a test pod to get an error that identifies the broken webhook. The error message might have the name of the broken webhook.
kubectl run webhook-test --generator=run-pod/v1 --image pause:latest
{: pre}
In the following example, the webhook is
trust.hooks.securityenforcement.admission.cloud.ibm.com
.Error from server (InternalError): Internal error occurred: failed calling webhook "trust.hooks.securityenforcementadmission.cloud.ibm.com": Post https://ibmcloud-image-enforcement.ibm-system.svc:443/mutating-pods?timeout=30s: dialtcp 172.21.xxx.xxx:443: connect: connection timed out
{: screen}
-
Get the name of the broken webhook.
-
If the error message has a broken webhook, replace
trust.hooks.securityenforcement.admission.cloud.ibm.com
with the broken webhook that you previously identified.kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jsonpath='{.items[?(@.webhooks[*].name=="trust.hooks.securityenforcement.admission.cloud.ibm.com")].metadata.name}{"\n"}'
{: pre}
Example output:
image-admission-config
{: pre}
-
If the error does not have a broken webhook, list all the webhooks in your cluster and check their configurations in the following steps.
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations
{: pre}
-
-
Review the service and location details of the mutating or validating webhook configuration in the
clientConfig
section in the output of the following command. Replaceimage-admission-config
with the name that you previously identified. If the webhook exists outside the cluster, contact the cluster owner to check the webhook status.kubectl get mutatingwebhookconfiguration image-admission-config -o yaml
{: pre}
kubectl get validatingwebhookconfigurations image-admission-config -o yaml
{: pre}
Example output:
clientConfig: caBundle: <redacted> service: name: <name> namespace: <namespace> path: /inject port: 443
{: screen}
-
Optional: Back up the webhooks, especially if you do not know how to reinstall the webhook.
kubectl get mutatingwebhookconfiguration <name> -o yaml > mutatingwebhook-backup.yaml
{: pre}
kubectl get validatingwebhookconfiguration <name> -o yaml > validatingwebhook-backup.yaml
{: pre}
-
Check the status of the related service and pods for the webhook.
- Check the service Type, Selector, and Endpoint fields.
{: pre}
kubectl describe service -n <namespace> <service_name>
- If the service type is ClusterIP, check that the OpenVPN pod is in a Running status so that the webhook can connect securely to the Kubernetes API in the cluster master. If the pod is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
{: pre}
kubectl describe pods -n kube-system -l app=vpn
- If the service does not have an endpoint, check the health of the backing resources, such as a deployment or pod. If the resource is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
{: pre}
kubectl get all -n my-service-namespace -l <key=value>
- If the service does not have any backing resources, or if troubleshooting the pods does not resolve the issue, remove the webhook.
{: pre}
kubectl delete mutatingwebhook <name>
- Check the service Type, Selector, and Endpoint fields.
-
Retry the cluster master operation, such as updating the cluster.
-
If you still see the error, you might have worker node or network connectivity issues.
- Worker node troubleshooting.
- Make sure that the webhook can connect to the Kubernetes API server in the cluster master. For example, if you use Calico network policies, security groups, or some other type of firewall, set up your classic or VPC cluster with the appropriate access.
- If the webhook is managed by an add-on that you installed, uninstall the add-on. Common add-ons that cause webhook issues include the following:
-
Re-create the webhook or reinstall the add-on.
{: #portieris_enable}
Infrastructure provider:
{: tsSymptoms} Portieris image security enforcement add-on does not install. You see a master status similar to the following:
Image security enforcement update cancelled. CAE008: Cannot enable Portieris image security enforcement because the cluster already has a conflicting image admission controller installed. For more information, see the troubleshooting docs: 'https://ibm.biz/portieris_enable'
{: screen}
{: tsCauses} Your cluster has a conflicting image admission controller already installed, which prevents the image security enforcement cluster add-on from installing. When you have more than one image admission controller in your cluster, pods might not run.
Potential conflicting image admission controller sources include:
- The deprecated container image security enforcement Helm chart.
- A previous manual installation of the open source Portieris{: external} project.
{: tsResolve} Identify and remove the conflicting image admission controller.
-
Check for existing image admission controllers.
-
Check if you have an existing container image security enforcement deployment in your cluster. If no output is returned, you do not have the deployment.
kubectl get deploy cise-ibmcloud-image-enforcement -n ibm-system
{: pre}
Example output:
NAME READY UP-TO-DATE AVAILABLE AGE cise-ibmcloud-image-enforcement 3/3 3 3 129m
{: pre}
-
Check if you have an existing Portieris deployment in your cluster. If no output is returned, you do not have the deployment.
kubectl get deployment --all-namespaces -l app=portieries
{: pre}
Example output:
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE portieris portieris 3/3 3 3 8m8s
{: pre}
-
-
Uninstall the conflicting deployment.
- For container image security enforcement, see the {{site.data.keyword.registrylong_notm}} documentation.
- For Portieris, see the open source documentation{: external}.
-
Confirm that conflicting admission controllers are removed by checking that the cluster no longer has a mutating webhook configuration for an image admission controller.
kubectl get MutatingWebhookConfiguration image-admission-config
{: pre}
Example output:
Error from server (NotFound): mutatingwebhookconfigurations.admissionregistration.k8s.io "image-admission-config" not found
{: pre}
-
Retry the installing the add-on by running the
ibmcloud ks cluster image-security enable
command.
{: #cs_cluster_pending}
Infrastructure provider:
{: tsSymptoms} When you deploy your cluster, it remains in a pending state and doesn't start.
{: tsCauses} If you just created the cluster, the worker nodes might still be configuring. If you already wait for a while, you might have an invalid VLAN.
{: tsResolve}
You can try one of the following solutions:
- Check the status of your cluster by running
ibmcloud ks cluster ls
. Then, check to be sure that your worker nodes are deployed by runningibmcloud ks worker ls --cluster <cluster_name>
. - Check to see whether your VLAN is valid. To be valid, a VLAN must be associated with infrastructure that can host a worker with local disk storage. You can list your VLANs by running
ibmcloud ks vlan ls --zone <zone>
if the VLAN does not show in the list, then it is not valid. Choose a different VLAN.
{: #cs_cluster_access}
Infrastructure provider:
{: tsSymptoms}
- You are not able to find a cluster. When you run
ibmcloud ks cluster ls
, the cluster is not listed in the output. - You are not able to work with a cluster. When you run
ibmcloud ks cluster config
or other cluster-specific commands, the cluster is not found.
{: tsCauses}
In {{site.data.keyword.cloud_notm}}, each resource must be in a resource group. For example, cluster mycluster
might exist in the default
resource group. When the account owner gives you access to resources by assigning you an {{site.data.keyword.cloud_notm}} IAM platform access role, the access can be to a specific resource or to the resource group. When you are given access to a specific resource, you don't have access to the resource group. In this case, you don't need to target a resource group to work with the clusters you have access to. If you target a different resource group than the group that the cluster is in, actions against that cluster can fail. Conversely, when you are given access to a resource as part of your access to a resource group, you must target a resource group to work with a cluster in that group. If you don't target your CLI session to the resource group that the cluster is in, actions against that cluster can fail.
If you cannot find or work with a cluster, you might be experiencing one of the following issues:
- You have access to the cluster and the resource group that the cluster is in, but your CLI session is not targeted to the resource group that the cluster is in.
- You have access to the cluster, but not as part of the resource group that the cluster is in. Your CLI session is targeted to this or another resource group.
- You don't have access to the cluster.
{: tsResolve} To check your user access permissions:
-
List all of your user permissions.
ibmcloud iam user-policies <your_user_name>
{: pre}
-
Check if you have access to the cluster and to the resource group that the cluster is in.
- Look for a policy that has a Resource Group Name value of the cluster's resource group and a Memo value of
Policy applies to the resource group
. If you have this policy, you have access to the resource group. For example, this policy indicates that a user has access to thetest-rg
resource group:{: screen}Policy ID: 3ec2c069-fc64-4916-af9e-e6f318e2a16c Roles: Viewer Resources: Resource Group ID 50c9b81c983e438b8e42b2e8eca04065 Resource Group Name test-rg Memo Policy applies to the resource group
- Look for a policy that has a Resource Group Name value of the cluster's resource group, a Service Name value of
containers-kubernetes
or no value, and a Memo value ofPolicy applies to the resource(s) within the resource group
. If you this policy, you have access to clusters or to all resources within the resource group. For example, this policy indicates that a user has access to clusters in thetest-rg
resource group:{: screen}Policy ID: e0ad889d-56ba-416c-89ae-a03f3cd8eeea Roles: Administrator Resources: Resource Group ID a8a12accd63b437bbd6d58fb6a462ca7 Resource Group Name test-rg Service Name containers-kubernetes Service Instance Region Resource Type Resource Memo Policy applies to the resource(s) within the resource group
- If you have both of these policies, skip to Step 4, first bullet. If you don't have the policy from Step 2a, but you do have the policy from Step 2b, skip to Step 4, second bullet. If you do not have either of these policies, continue to Step 3.
- Look for a policy that has a Resource Group Name value of the cluster's resource group and a Memo value of
-
Check if you have access to the cluster, but not as part of access to the resource group that the cluster is in.
- Look for a policy that has no values besides the Policy ID and Roles fields. If you have this policy, you have access to the cluster as part of access to the entire account. For example, this policy indicates that a user has access to all resources in the account:
{: screen}
Policy ID: 8898bdfd-d520-49a7-85f8-c0d382c4934e Roles: Administrator, Manager Resources: Service Name Service Instance Region Resource Type Resource
- Look for a policy that has a Service Name value of
containers-kubernetes
and a Service Instance value of the cluster's ID. You can find a cluster ID by runningibmcloud ks cluster get --cluster <cluster_name>
. For example, this policy indicates that a user has access to a specific cluster:{: screen}Policy ID: 140555ce-93ac-4fb2-b15d-6ad726795d90 Roles: Administrator Resources: Service Name containers-kubernetes Service Instance df253b6025d64944ab99ed63bb4567b6 Region Resource Type Resource
- If you have either of these policies, skip to the second bullet point of step 4. If you do not have either of these policies, skip to the third bullet point of step 4.
- Look for a policy that has no values besides the Policy ID and Roles fields. If you have this policy, you have access to the cluster as part of access to the entire account. For example, this policy indicates that a user has access to all resources in the account:
-
Depending on your access policies, choose one of the following options.
-
If you have access to the cluster and to the resource group that the cluster is in:
-
Target the resource group. Note: You can't work with clusters in other resource groups until you untarget this resource group.
ibmcloud target -g <resource_group>
{: pre}
-
Target the cluster.
ibmcloud ks cluster config --cluster <cluster_name_or_ID>
{: pre}
-
-
If you have access to the cluster but not to the resource group that the cluster is in:
- Do not target a resource group. If you already targeted a resource group, untarget it:
ibmcloud target --unset-resource-group
{: pre}
- Target the cluster.
ibmcloud ks cluster config --cluster <cluster_name_or_ID>
{: pre}
-
If you do not have access to the cluster:
- Ask your account owner to assign an {{site.data.keyword.cloud_notm}} IAM platform access role to you for that cluster.
- Do not target a resource group. If you already targeted a resource group, untarget it:
ibmcloud target --unset-resource-group
{: pre} - Target the cluster.
ibmcloud ks cluster config --cluster <cluster_name_or_ID>
{: pre}
-
{: #cs_firewall}
Infrastructure provider: Classic
{: tsSymptoms} When the worker nodes in your cluster cannot communicate on the private network, you might see various different symptoms.
-
Sample error message when you run
kubectl exec
,attach
,logs
,proxy
, orport-forward
:Error from server: error dialing backend: dial tcp XXX.XXX.XXX:10250: getsockopt: connection timed out
{: screen}
-
Sample error message when
kubectl proxy
succeeds, but the Kubernetes dashboard is not available:timeout on 172.xxx.xxx.xxx
{: screen}
-
Sample error message when
kubectl proxy
fails or the connection to your service fails:Connection refused
{: screen}
Connection timed out
{: screen}
Unable to connect to the server: net/http: TLS handshake timeout
{: screen}
{: tsCauses} To access resources in the cluster, your worker nodes must be able to communicate on the private network. You might have a Vyatta or another firewall set up, or customized your existing firewall settings in your IBM Cloud infrastructure account. {{site.data.keyword.containerlong_notm}} requires certain IP addresses and ports to be opened to allow communication from the worker node to the Kubernetes master and vice versa. If your worker nodes are spread across multiple zones, you must allow private network communication by enabling VLAN spanning. Communication between worker nodes might also not be possible if your worker nodes are stuck in a reloading loop.
{: tsResolve}
-
List the worker nodes in your cluster and verify that your worker nodes are not stuck in a
Reloading
state.ibmcloud ks worker ls --cluster <cluster_name_or_id>
{: pre}
-
If you have a multizone cluster and your account is not enabled for VRF, verify that you enabled VLAN spanning for your account.
-
If you have a Vyatta or custom firewall settings, make sure that you opened up the required ports to allow the cluster to access infrastructure resources and services.