You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. What kops version are you running? The command kops version, will display
this information. 1.29.2 (git-v1.29.2) 2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag. v1.29.6 3. What cloud provider are you using? AWS 4. What commands did you run? What is the simplest way to reproduce this issue?
After editing kops config with the new k8s version I ran the following commands: kops get assets --copy --state $KOPS_REMOTE_STATE kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m
5. What happened after the commands executed?
The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.
The following error was appearing in the kops update logs:
I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.
Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.
After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.
Autoscaler logs for one of those nodes:
1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)
And the relevant log line from the kops-controler:
E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"
Also kube-system pods are pending to be created on those nodes for some reason:
6. What did you expect to happen?
I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.
7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2kind: Clustermetadata:
creationTimestamp: nullgeneration: 4name: develop.company.comspec:
api:
loadBalancer:
class: NetworksslCertificate: arn:aws:acm:eu-west-1:1234:certificate/1111type: Internalassets:
containerProxy: public.ecr.aws/12344fileRepository: https://bucket.s3.eu-west-1.amazonaws.com/authentication:
aws: {}authorization:
rbac: {}certManager:
defaultIssuer: selfsignedenabled: truechannel: stablecloudLabels:
Prometheus: "true"aws-region: eu-west-1cloudProvider: awsconfigBase: s3://tf-remotestate-eu-west-1-123456/kops/develop.company.comdnsZone: ###etcdClusters:
- cpuRequest: 200metcdMembers:
- instanceGroup: master-eu-west-1aname: eu-west-1a
- instanceGroup: master-eu-west-1bname: eu-west-1b
- instanceGroup: master-eu-west-1cname: eu-west-1cmanager:
env:
- name: ETCD_LISTEN_METRICS_URLSvalue: http://0.0.0.0:8081
- name: ETCD_METRICSvalue: basicmemoryRequest: 100Miname: mainversion: 3.4.13
- cpuRequest: 100metcdMembers:
- instanceGroup: master-eu-west-1aname: eu-west-1a
- instanceGroup: master-eu-west-1bname: eu-west-1b
- instanceGroup: master-eu-west-1cname: eu-west-1cmemoryRequest: 100Miname: eventsversion: 3.4.13externalPolicies:
master:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCorenode:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- arn:aws:iam::1234:policy/nodes-extra.develop.company.comfileAssets:
- content: | # https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml apiVersion: audit.k8s.io/v1 # This is required. kind: Policy # Don't generate audit events for all requests in RequestReceived stage. omitStages: - "RequestReceived" rules: # Log pod changes at RequestResponse level - level: RequestResponse resources: - group: "" # Resource "pods" doesn't match requests to any subresource of pods, # which is consistent with the RBAC policy. resources: ["pods"] # Log "pods/log", "pods/status" at Metadata level - level: Metadata resources: - group: "" resources: ["pods/log", "pods/status"] # Don't log requests to a configmap called "controller-leader" - level: None resources: - group: "" resources: ["configmaps"] resourceNames: ["controller-leader"] # Don't log watch requests by the "system:kube-proxy" on endpoints or services - level: None users: ["system:kube-proxy"] verbs: ["watch"] resources: - group: "" # core API group resources: ["endpoints", "services"] # Don't log authenticated requests to certain non-resource URL paths. - level: None userGroups: ["system:authenticated"] nonResourceURLs: - "/api*" # Wildcard matching. - "/version" # Log the request body of configmap changes in kube-system. - level: Request resources: - group: "" # core API group resources: ["configmaps"] # This rule only applies to resources in the "kube-system" namespace. # The empty string "" can be used to select non-namespaced resources. namespaces: ["kube-system"] # Log configmap and secret changes in all other namespaces at the Metadata level. - level: Metadata resources: - group: "" # core API group resources: ["secrets", "configmaps"] # Log all other resources in core and extensions at the Request level. - level: Request resources: - group: "" # core API group - group: "extensions" # Version of group should NOT be included. # A catch-all rule to log all other requests at the Metadata level. - level: Metadata # Long-running requests like watches that fall under this rule will not # generate an audit event in RequestReceived. omitStages: - "RequestReceived" name: kubernetes-audit.yaml path: /srv/kubernetes/assets/audit.yaml roles: - Masteriam:
allowContainerRegistry: truelegacy: falseserviceAccountExternalPermissions:
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-aws-efs-csi-drivername: efs-csi-controller-sanamespace: kube-system
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-aws-lb-controllername: aws-lb-controller-aws-load-balancer-controllernamespace: kube-system
- aws:
policyARNs:
- arn:aws:iam::1234:policy/dub-company-cluster-autoscalername: cluster-autoscaler-aws-cluster-autoscalernamespace: kube-systemkubeAPIServer:
authenticationTokenWebhookConfigFile: /srv/kubernetes/aws-iam-authenticator/kubeconfig.yamlruntimeConfig:
autoscaling/v2beta1: "true"kubeControllerManager:
horizontalPodAutoscalerCpuInitializationPeriod: 20shorizontalPodAutoscalerDownscaleDelay: 5m0shorizontalPodAutoscalerDownscaleStabilization: 5m0shorizontalPodAutoscalerInitialReadinessDelay: 20shorizontalPodAutoscalerSyncPeriod: 5shorizontalPodAutoscalerTolerance: 100mhorizontalPodAutoscalerUpscaleDelay: 3m0skubeDNS:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kops.k8s.io/instancegroupoperator: Invalues:
- workers-miscexternalCoreFile: | amazonaws.com:53 { errors log . { class denial error } health :8084 prometheus :9153 forward . 169.254.169.253 { } cache 30 } .:53 { errors health :8080 ready :8181 autopath @kubernetes kubernetes cluster.local { pods verified fallthrough in-addr.arpa ip6.arpa } prometheus :9153 forward . 169.254.169.253 cache 300 }nodeLocalDNS:
cpuRequest: 25menabled: truememoryRequest: 5Miprovider: CoreDNStolerations:
- effect: NoScheduleoperator: ExistskubeProxy:
metricsBindAddress: 0.0.0.0kubelet:
anonymousAuth: falseauthenticationTokenWebhook: trueauthorizationMode: WebhookmaxPods: 35resolvConf: /etc/resolv.confkubernetesApiAccess:
- 10.0.0.0/8kubernetesVersion: 1.29.6masterPublicName: api.develop.company.comnetworkCIDR: 10.0.128.0/20networkID: vpc-1234networking:
amazonvpc:
env:
- name: WARM_IP_TARGETvalue: "5"
- name: MINIMUM_IP_TARGETvalue: "8"
- name: DISABLE_METRICSvalue: "true"nonMasqueradeCIDR: 100.64.0.0/10podIdentityWebhook:
enabled: truerollingUpdate:
maxSurge: 100%serviceAccountIssuerDiscovery:
discoveryStore: s3://infra-eu-west-1-discoveryenableAWSOIDCProvider: truesshAccess:
- 10.0.0.0/8subnets:
- cidr: 10.0.128.0/22id: subnet-123name: eu-west-1atype: Privatezone: eu-west-1a
- cidr: 10.0.132.0/22id: subnet-123name: eu-west-1btype: Privatezone: eu-west-1b
- cidr: 10.0.136.0/22id: subnet-132name: eu-west-1ctype: Privatezone: eu-west-1c
- cidr: 10.0.140.0/24id: subnet-1123name: utility-eu-west-1atype: Utilityzone: eu-west-1a
- cidr: 10.0.141.0/24id: subnet-132name: utility-eu-west-1btype: Utilityzone: eu-west-1b
- cidr: 10.0.142.0/24id: subnet-123name: utility-eu-west-1ctype: Utilityzone: eu-west-1ctopology:
dns:
type: Public
---
apiVersion: kops.k8s.io/v1alpha2kind: InstanceGroupmetadata:
creationTimestamp: "2024-10-02T10:12:50Z"labels:
kops.k8s.io/cluster: develop.company.comname: master-eu-west-1aspec:
additionalSecurityGroups:
- sg-1234cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""k8s.io/cluster-autoscaler/disabled: ""k8s.io/cluster-autoscaler/master-template/label: ""image: ami-09634b5569ee59efbmachineType: t3.largemaxSize: 1minSize: 1nodeLabels:
kops.k8s.io/instancegroup: masterskops.k8s.io/spotinstance: "false"on-demand: "true"role: MasterrootVolumeType: gp3subnets:
- eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2kind: InstanceGroupmetadata:
creationTimestamp: "2024-10-02T10:12:50Z"labels:
kops.k8s.io/cluster: develop.company.comname: master-eu-west-1bspec:
additionalSecurityGroups:
- sg-123cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""k8s.io/cluster-autoscaler/disabled: ""k8s.io/cluster-autoscaler/master-template/label: ""image: ami-09634b5569ee59efbmachineType: t3.largemaxSize: 1minSize: 1nodeLabels:
kops.k8s.io/instancegroup: masterskops.k8s.io/spotinstance: "false"on-demand: "true"role: MasterrootVolumeType: gp3subnets:
- eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2kind: InstanceGroupmetadata:
creationTimestamp: "2024-10-02T10:12:51Z"labels:
kops.k8s.io/cluster: develop.company.comname: master-eu-west-1cspec:
additionalSecurityGroups:
- sg-123cloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""k8s.io/cluster-autoscaler/disabled: ""k8s.io/cluster-autoscaler/master-template/label: ""image: ami-09634b5569ee59efbmachineType: t3.largemaxSize: 1minSize: 1nodeLabels:
kops.k8s.io/instancegroup: masterskops.k8s.io/spotinstance: "false"on-demand: "true"role: MasterrootVolumeType: gp3subnets:
- eu-west-1c
---
apiVersion: kops.k8s.io/v1alpha2kind: InstanceGroupmetadata:
creationTimestamp: "2024-10-02T10:12:51Z"generation: 2labels:
kops.k8s.io/cluster: develop.company.comname: workers-appspec:
additionalSecurityGroups:
- sg-132
- sg-3322additionalUserData:
- content: | #!/bin/bash echo "Starting additionalUserData" echo "This script will execute before nodeup.sh because cloud-init executes scripts in alphabetic order by name" export DEBIAN_FRONTEND=noninteractive apt-get update # Install some tools apt install -y nfs-common # Required to make EFS volume mount apt install -y containerd # Required for nerdctl to work, container not installed until nodeup runs echo $(containerd --version) wget https://github.com/containerd/nerdctl/releases/download/v1.7.2/nerdctl-1.7.2-linux-amd64.tar.gz -O /tmp/nerdctl.tar.gz tar -C /usr/local/bin/ -xzf /tmp/nerdctl.tar.gz echo $(nerdctl version) apt install -y awscli echo $(aws --version) # Get some temporary aws ecr credentials DOCKER_PASSWORD=$(aws ecr get-login-password --region eu-west-1) DOCKER_USER=AWS DOCKER_REGISTRY=1234.dkr.ecr.eu-west-1.amazonaws.com PASSWD=$(echo "$DOCKER_USER:$DOCKER_PASSWORD" | tr -d '\n' | base64 -i -w 0) CONFIG="\ {\n \"auths\": {\n \"$DOCKER_REGISTRY\": {\n \"auth\": \"$PASSWD\"\n }\n }\n }\n" mkdir -p ~/.docker printf "$CONFIG" > ~/.docker/config.json echo "Pulling images from ECR" nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/fluent-bit:2.2.2 nerdctl pull --namespace k8s.io 1234.dkr.ecr.eu-west-1.amazonaws.com/nginx-prometheus-exporter:0.9.0 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/dns/k8s-dns-node-cache:1.23.0 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni-init:v1.18.1 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/amazon-k8s-cni:v1.18.1 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kube-proxy:v1.28.11 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/ebs-csi-driver/aws-ebs-csi-driver:v1.30.0 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/eks-distro/kubernetes-csi/node-driver-registrar:v2.10.0-eks-1-29-5 nerdctl pull --namespace k8s.io public.ecr.aws/1234545/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-5 echo "Remove and unmask containerd so it can be reinstalled by nodeup and configured how it wants it." apt remove -y containerd systemctl unmask containerd echo "Finishing additionalUserData" name: all-images.sh type: text/x-shellscriptcloudLabels:
k8s.io/cluster-autoscaler/develop.company.com: ""k8s.io/cluster-autoscaler/enabled: ""k8s.io/cluster-autoscaler/node-template/label: ""image: ami-09634b5569ee59efbinstanceMetadata:
httpPutResponseHopLimit: 1httpTokens: requiredmachineType: c5.18xlargemaxSize: 10minSize: 1nodeLabels:
Environment: company-developGroup: company-develop-appName: company-develop-infra-appService: companykops.k8s.io/instancegroup: workers-appkops.k8s.io/spotinstance: "false"on-demand: "true"role: NoderootVolumeType: gp3subnets:
- eu-west-1a
- eu-west-1b
- eu-west-1csuspendProcesses:
- AZRebalancewarmPool:
enableLifecycleHook: truemaxSize: 10minSize: 5
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
The text was updated successfully, but these errors were encountered:
sorry for the direct message, just last time you helped to solve the issue quickly :).
We are heavily relying on Kops(having 40+ clusters) and using Warmpool. In the recent releases of 1.29 the Warpools have been changed with the following PRs, which brought the mentioned issue.
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.29.2 (git-v1.29.2)
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.v1.29.6
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
After editing kops config with the new k8s version I ran the following commands:
kops get assets --copy --state $KOPS_REMOTE_STATE
kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade
kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m
5. What happened after the commands executed?
The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.
The following error was appearing in the kops update logs:
I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.
Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.
After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.
Autoscaler logs for one of those nodes:
1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)
And the relevant log line from the kops-controler:
E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"
Also kube-system pods are pending to be created on those nodes for some reason:
6. What did you expect to happen?
I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
The text was updated successfully, but these errors were encountered: