Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: unhealthy cluster - https://localhost:2379 #3

Closed
hernad opened this issue Jul 1, 2020 · 24 comments
Closed

Error: unhealthy cluster - https://localhost:2379 #3

hernad opened this issue Jul 1, 2020 · 24 comments

Comments

@hernad
Copy link

hernad commented Jul 1, 2020

Hi, my bootstrap node reports this error:

[root@okd4-snc-bootstrap ~]# journalctl -b -f -u bootkube.service

...
Jul 01 16:27:29 okd4-snc-bootstrap.snc.test bootkube.sh[674]: {"level":"warn","ts":"2020-07-01T16:27:29.381Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-9f885c4d-609f-444a-beab-393ba59f3c08/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Jul 01 16:27:29 okd4-snc-bootstrap.snc.test bootkube.sh[674]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jul 01 16:27:29 okd4-snc-bootstrap.snc.test bootkube.sh[674]: Error: unhealthy cluster
@hernad
Copy link
Author

hernad commented Jul 1, 2020

DNS setup:

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> axfr @192.168.168.10 snc.test
; (1 server found)
;; global options: +cmd
snc.test.               10800   IN      SOA     dc1.sa.out.ba. root.sa.out.ba. 18 28800 7200 604800 3600
snc.test.               900     IN      NS      ns.snc.test.
api.okd4-snc.snc.test.  900     IN      A       192.168.168.164
api.okd4-snc.snc.test.  900     IN      A       192.168.168.165
_etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test.
okd4-snc-host.snc.test. 900     IN      A       192.168.168.160
api-int.okd4-snc.snc.test. 900  IN      A       192.168.168.164
api-int.okd4-snc.snc.test. 900  IN      A       192.168.168.165
*.apps.okd4-snc.snc.test. 900   IN      CNAME   okd4-snc-master.snc.test.
okd4-snc-bootstrap.snc.test. 900 IN     A       192.168.168.164
ns.snc.test.            900     IN      A       192.168.168.10
etcd-0.okd4-snc.snc.test. 900   IN      A       192.168.168.165
okd4-snc-master.snc.test. 900   IN      A       192.168.168.165
snc.test.               10800   IN      SOA     dc1.sa.out.ba. root.sa.out.ba. 18 28800 7200 604800 3600

@hernad
Copy link
Author

hernad commented Jul 1, 2020

[root@hp-144 okd4-snc]# cat ~/bin/setSncEnv.sh

export SNC_DOMAIN=snc.test
export SNC_HOST=192.168.168.160
#export SNC_NAMESERVER=${SNC_HOST}
export SNC_NAMESERVER=192.168.168.10
export SNC_NETMASK=255.255.255.0
export SNC_GATEWAY=192.168.168.254
export INSTALL_HOST_IP=${SNC_HOST}
export INSTALL_ROOT=/usr/share/nginx/html/install
export INSTALL_URL=http://${SNC_HOST}/install
export OKD4_SNC_PATH=/root/okd4-snc
export OKD_REGISTRY=quay.io/openshift/okd
export OKD_RELEASE=4.4.0-0.okd-2020-05-23-055148-beta5

@hernad
Copy link
Author

hernad commented Jul 1, 2020

[root@hp-144 okd4-snc]# cat install-config-snc.yaml

apiVersion: v1
baseDomain: snc.test
metadata:
  name: okd4-snc
networking:
  networkType: OpenShiftSDN
  clusterNetwork:
  - cidr: 10.100.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
compute:
- name: worker
  replicas: 0
controlPlane:
  name: master
  replicas: 1
platform:
  none: {}
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: ssh-rsa AAAABetc etc etc root@hp-144

@hernad
Copy link
Author

hernad commented Jul 1, 2020

FCOS defined in ~/bin/DeployOkdSnc.sh

CPU="4"
MEMORY="16384"
DISK="200"
FCOS_VER=31.20200505.2.0
FCOS_STREAM=testing

@hernad hernad changed the title https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster - https://localhost:2379 Jul 1, 2020
@hernad
Copy link
Author

hernad commented Jul 1, 2020

nginx on host is ok:

curl http://okd4-snc-host.snc.test/install/fcos/ignition/bootstrap.ign

100 279k 100 279k 0 0 18.2M 0 --:--:-- --:--:-- --:--:-- 19.4M

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

The FCOS version is old because recent versions of FCOS broke my install. I'm working on an alternative that works with the live ISO. The replacement of the isolinux.cfg that I'm doing in the deployment script no longer works with more recent versions of FCOS... I don't know why yet.

The error that you reported above is normal while the bootstrap is starting up. It can take a few minutes before it's up and listening on port 2379.

How long did you wait?

@hernad
Copy link
Author

hernad commented Jul 1, 2020

Just to add, I have also tried instalation with newer FCOS images, and okd 4.5.x, with no success. Exactly, with these FCOS:

#FCOS_VER=32.20200615.3.0
#FCOS_STREAM=stable

#FCOS_VER=32.20200629.2.0
#FCOS_STREAM=testing

#FCOS_VER=31.20200517.3.0
#FCOS_STREAM=stable

and this OKD

#export OKD_RELEASE=4.5.0-0.okd-2020-06-29-110348-beta6

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.

See: okd-project/okd#229 and okd-project/okd#238

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

I've got a long weekend coming up with the holiday here in the states. I hope to get some work done on this. recent versions of FCOS seem to have broken it.

@hernad
Copy link
Author

hernad commented Jul 1, 2020

Thanks for your feeedback @cgruver.

How long did you wait?

As long as I am writing this :). About 30 minutes. Still the same...

@hernad
Copy link
Author

hernad commented Jul 1, 2020

This is master side virsh console:

[root@hp-144 okd4-snc]# virsh console okd4-snc-master

Connected to domain okd4-snc-master
[ ***  ] A start job is running for Ignition (fetch) (50min 37s / no limit)
[ 3040.240220] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #606
[ 3040.247110] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[     *] A start job is running for Ignition (fetch) (50min 42s / no limit)
[ 3045.248142] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #607
[ 3045.256863] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[ ***  ] A start job is running for Ignition (fetch) (50min 47s / no limit)
[ 3050.257876] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #608
[ 3050.266642] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[     *] A start job is running for Ignition (fetch) (50min 49s / no limit)
...

@hernad
Copy link
Author

hernad commented Jul 1, 2020

Master is obviously stuck at ignition phase.

@hernad
Copy link
Author

hernad commented Jul 1, 2020

It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.

You are right. ssh login to bootstrap node Fedora CoreOS 31.20200521.20.0. There was an upgrade from 31.20200505.2.0

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

Yes, what you are seeing is similar to the problem that I am having now building a full cluster. The master nodes cannot pull the ignition from the bootstrap node. I think this is related to the issues I listed above.

try:

curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

See if you get a 500 error. That's what I am seeing. The bootstrap node is failing to serve up the ignition files.

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

Track progress here: okd-project/okd#239

@cgruver
Copy link
Owner

cgruver commented Jul 1, 2020

Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...

What is the output of: curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

Run it several times to make sure that DNS round-robin is working. It should hit your bootstrap node.

@hernad
Copy link
Author

hernad commented Jul 1, 2020

curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

* About to connect() to api-int.okd4-snc.snc.test port 22623 (#0)
*   Trying 192.168.168.165...
* Connection refused
*   Trying 192.168.168.164...
* Connection refused
* Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused
* Closing connection 0
curl: (7) Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused

There is no service on port 22623 ?!

@hernad
Copy link
Author

hernad commented Jul 1, 2020

active services on bootstrap:

[root@okd4-snc-bootstrap ~]# netstat -tlnp

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:34083         0.0.0.0:*               LISTEN      805/crio            
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      894/kubelet         
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      723/sshd            
tcp        0      0 0.0.0.0:49241           0.0.0.0:*               LISTEN      798/rpc.statd       
tcp6       0      0 :::6080                 :::*                    LISTEN      3705/kube-etcd-sign 
tcp6       0      0 :::10250                :::*                    LISTEN      894/kubelet         
tcp6       0      0 :::6443                 :::*                    LISTEN      3705/kube-etcd-sign 
tcp6       0      0 :::10255                :::*                    LISTEN      894/kubelet         
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::40593                :::*                    LISTEN      798/rpc.statd       
tcp6       0      0 :::22                   :::*                    LISTEN      723/sshd

@hernad
Copy link
Author

hernad commented Jul 1, 2020

Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...

I have noticed that.

@cgruver
Copy link
Owner

cgruver commented Jul 2, 2020

Try tearing it down, and running everything again.

DestroyBootstrap.sh 
UnDeploySncNode.sh 

Double check your DNS config against the files that I provided. This entry may be incorrect:

_etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test.

I believe that there should not be a . after _etcd-server-ssl._tcp.okd4-snc.snc.test

Also note that after the bootstrap process completes, you will have to remove the A records for api and api-int that refer to the bootstrap node IP. That is why I include the remove-after-bootstrap in my example zone file.

@cgruver
Copy link
Owner

cgruver commented Jul 2, 2020

I just pushed an update that works with FCOS 32 and OKD 4 Beta 6

It also tested with Beta 5

@hernad
Copy link
Author

hernad commented Jul 3, 2020

@cgruver, great work !

Last day I have finally achieved a working cluster using this configuration:
https://github.com/hernad/okd4-snc-qemu

It is based on your work mostly. The difference is loading ingition file via qemu firmware option.
The positive thing about this configuration is that http nginx server is not needed. I had success with FC32 last test image and 4.5 okd.

I just pushed an update that works with FCOS 32 and OKD 4 Beta 6

I will try this after current investigation of my first working cluster :)

Again, thank for your work and support.

@hernad
Copy link
Author

hernad commented Jul 3, 2020

I believe that there should not be a . after _etcd-server-ssl._tcp.okd4-snc.snc.test

For your information, dot at the end is OK. It is standard to put in NS configuration to say "this is full qualified name - STOP".

I have seen similar examples in OKD documentation where FQDN is finished with dot.

@cgruver
Copy link
Owner

cgruver commented Jul 3, 2020

Excellent!

I will take a look at your config. Eliminating the Nginx server will simplify the deployment for folks.

@cgruver cgruver closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants