Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(config): some parameters are incorrect #20

Open
divfor opened this issue Sep 5, 2021 · 25 comments
Open

bug(config): some parameters are incorrect #20

divfor opened this issue Sep 5, 2021 · 25 comments
Assignees
Labels
bug Something isn't working

Comments

@divfor
Copy link

divfor commented Sep 5, 2021

config/compose/certs/下面本来放的是2个文件,结果成了目录,所以启动nginx加载证书出错
另外,nginx.conf里面的registry:5000好像也不能自动替换为IP, 手动修复可以通过

@divfor
Copy link
Author

divfor commented Sep 5, 2021

NTP_SERVER定义为unbond,只能执行 NTP_SERVER=xxx ./install.sh绕过

@divfor divfor changed the title domain.crt domain.key 被展开成了目录 kubeplay下载使用发现的一些错误 Sep 5, 2021
@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

已知问题,之前修复了忘记重新构建安装包了。刚刚重新构建了,重新下载试一下 https://github.com/k8sli/kubeplay/releases/tag/v0.1.0-alpha.3

@divfor
Copy link
Author

divfor commented Sep 5, 2021

TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021  15:10:28 +0000 (0:00:00.591)       0:00:05.231 ******
Sunday 05 September 2021  15:10:28 +0000 (0:00:00.046)       0:00:05.278 ******

TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
  msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease  , E:Some index files failed to download. They have been ignored, or old ones used instead.'
fatal: [node1]: FAILED! => changed=false
  msg: 'Failed to update apt cache: E:The method driver /usr/lib/apt/methods/192.168.100.25 could not be found., W:Is the package apt-transport-192.168.100.25 installed?, E:Failed to fetch 192.168.100.25://8080/ubuntu/amd64/bionic/InRelease  , E:Some index files failed to download. They have been ignored, or old ones used instead.'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
node1                      : ok=9    changed=3    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0
node2                      : ok=9    changed=3    unreachable=0    failed=1    skipped=23   rescued=0    ignored=0

Sunday 05 September 2021  15:11:00 +0000 (0:00:31.961)       0:00:37.240 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache --------------------------------------------------------- 31.96s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.09s
download : download | Download files / images --------------------------------------------------------------- 0.86s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.59s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.40s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.32s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.28s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.13s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Copy nerdctl binary from download dir ---------------------------------- 0.05s
download : download | Get kubeadm binary and list of required images ---------------------------------------- 0.05s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
cluster/bootstrap-os : Configure offline resources repository on yum package manager ------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Create staging directory on remote node ------------------------------------------ 0.05s
download : prep_download | Set image pull/info command for containerd and crio ------------------------------ 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
container-engine/nerdctl : nerdctl | Download nerdctl ------------------------------------------------------- 0.04s
 ######  01-cluster-bootstrap-os installation failed  ######

@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

192.168.100.25://8080/ubuntu/amd64/bionic/InRelease 这里的 URL 有些问题,可能是配置文件填写错误

在安装包根目录执行 grep 'offline_resources_url' config/kubespray/env.yml,看下配置是否有误。

@muzi502 muzi502 changed the title kubeplay下载使用发现的一些错误 bug(config): some parameters are incorrect Sep 5, 2021
@muzi502 muzi502 added the bug Something isn't working label Sep 5, 2021
@muzi502 muzi502 added this to the v0.1.0-alpha.3 milestone Sep 5, 2021
@muzi502 muzi502 self-assigned this Sep 5, 2021
@divfor
Copy link
Author

divfor commented Sep 5, 2021

root@fredvb:~/kubeplay# grep 'offline_resources_url' config/kubespray/env.yml
offline_resources_url: 192.168.100.25:8080

@divfor
Copy link
Author

divfor commented Sep 5, 2021

多次执行,随机地,会出现末行错误而终止:

INFO[0000] Creating container nginx
INFO[0000] Creating container registry
✔ The registry container is running.
✔ The nginx container is running.
✖ Error: the http://192.168.100.25:8080/certs/rootCA.crt website is not running, and the status code is 000!

@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

config.yaml 配置文件发一下

@divfor
Copy link
Author

divfor commented Sep 5, 2021

这个每次必出现

✔ Updated the apt list file
E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages  File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory)
E: Some index files failed to download. They have been ignored, or old ones used instead.

@divfor
Copy link
Author

divfor commented Sep 5, 2021

root@fredvb:~/kubeplay# cat config.yaml
compose:
  # Compose bootstrap node ip, default is local internal ip
  internal_ip: 192.168.100.25
  # Nginx http server bind port for download files and packages
  nginx_http_port: 8080
  # Registry domain for CRI runtime download images
  registry_domain: kube.registry.local
kubespray:
  # Kubernetes version by default, only support v1.20.6
  kube_version: v1.21.4
  # For deploy HA cluster you must configure a external apiserver access ip
  external_apiserver_access_ip: 192.168.100.5
  # Set network plugin to calico with vxlan mode by default
  kube_network_plugin: calico
  #Container runtime, only support containerd if offline deploy
  container_manager: containerd
  # Now only support host if use containerd as CRI runtime
  etcd_deployment_type: host
  # Settings for etcd event server
  etcd_events_cluster_setup: true
  etcd_events_cluster_enabled: true
# Cluster nodes inventory info
inventory:
  all:
    vars:
      ansible_port: 22
      ansible_user: root
      ansible_ssh_pass: q1w2e3r4
      # ansible_ssh_private_key_file: /kubespray/config/id_rsa
    hosts:
      node1:
        ansible_host: 192.168.100.4
      node2:
        ansible_host: 192.168.100.5
    children:
      kube_control_plane:
        hosts:
          node2:
      kube_node:
        hosts:
          node1:
      etcd:
        hosts:
          node2:
      k8s_cluster:
        children:
          kube_control_plane:
          kube_node:
      gpu:
        hosts: {}
      calico_rr:
        hosts: {}
### Default parameters ###
## This filed not need config, will auto update,
## if no special requirement, do not modify these parameters.
default:
  # NTP server ip address or domain, default is internal_ip
  ntp_server:
    - 192.168.100.25
  # Registry ip address, default is internal_ip
  registry_ip: 192.168.100.25
  # Offline resource url for download files, default is internal_ip:nginx_http_port
  offline_resources_url: 192.168.100.25:8080
  # Use nginx and registry provide all offline resources
  offline_resources_enabled: true
  # Image repo in registry
  image_repository: library
  # Kubespray container image for deploy user cluster or scale
  kubespray_image: "kube.registry.local/library/kubespray:v2.16.0-154-geb42915a"
  # Auto generate self-signed certificate for registry domain
  generate_domain_crt: true
  # For nodes pull image, use 443 as default
  registry_https_port: 443
  # For push image to this registry, use 5000 as default, and only bind at 127.0.0.1
  registry_push_port: 5000
  # Set false to disable download all container images on all nodes
  download_container: false

@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

default 字段里的参数无特殊情况保持原本的内容即可,不需要修改。这里的文档说明可能不清晰,稍后会修改一下。

@divfor
Copy link
Author

divfor commented Sep 5, 2021

default改回去了,现在还是回到以下错误:

TASK [cluster/bootstrap-os : Configure offline resources repository on apt package manager] ************************
changed: [node1]
changed: [node2]
Sunday 05 September 2021  16:25:26 +0000 (0:00:00.613)       0:00:05.384 ******
Sunday 05 September 2021  16:25:26 +0000 (0:00:00.046)       0:00:05.431 ******

TASK [cluster/bootstrap-os : Update apt repository cache] **********************************************************
fatal: [node2]: FAILED! => changed=false
  msg: 'Failed to update apt cache: unknown reason'
fatal: [node1]: FAILED! => changed=false
  msg: 'Failed to update apt cache: unknown reason'

NO MORE HOSTS LEFT *************************************************************************************************

PLAY RECAP *********************************************************************************************************
node1                      : ok=9    changed=2    unreachable=0    failed=1    skipped=17   rescued=0    ignored=0
node2                      : ok=9    changed=2    unreachable=0    failed=1    skipped=23   rescued=0    ignored=0

Sunday 05 September 2021  16:28:29 +0000 (0:03:03.812)       0:03:09.243 ******
===============================================================================
cluster/bootstrap-os : Update apt repository cache -------------------------------------------------------- 183.81s
Gather minimal facts ---------------------------------------------------------------------------------------- 1.11s
download : download | Download files / images --------------------------------------------------------------- 0.87s
cluster/bootstrap-os : Configure offline resources repository on apt package manager ------------------------ 0.61s
Gather necessary facts (hardware) --------------------------------------------------------------------------- 0.54s
Gather necessary facts (network) ---------------------------------------------------------------------------- 0.41s
cluster/bootstrap-os : Backup system default package manager repo file -------------------------------------- 0.27s
cluster/bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------- 0.26s
download : prep_download | Create local cache for files and images on control node -------------------------- 0.13s
kubespray-defaults : Populates no_proxy to all hosts -------------------------------------------------------- 0.10s
cluster/bootstrap-os : gather os specific variables --------------------------------------------------------- 0.08s
cluster/bootstrap-os : include_tasks ------------------------------------------------------------------------ 0.06s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts --------------------------------------------- 0.06s
download : prep_download | Set image pull/info command for containerd and crio on localhost ----------------- 0.05s
container-engine/crictl : install crictĺ -------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker on localhost ------------------------------ 0.05s
download : prep_download | Check that local user is in group or can become root ----------------------------- 0.05s
download : prep_download | Set a few facts ------------------------------------------------------------------ 0.05s
kubespray-defaults : Configure defaults --------------------------------------------------------------------- 0.05s
download : prep_download | Set image pull/info command for docker ------------------------------------------- 0.05s
✖ ######  01-cluster-bootstrap-os installation failed  ######
root@fredvb:~/kubeplay#

@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

可能是你安装包下载的不对,系统是 ubuntu 18.04 ,下载的安装包也是 18.04 吗

@divfor
Copy link
Author

divfor commented Sep 5, 2021

都是18.04. 感觉是iptables没有设置对,nerdctl拉起之后,iptables没有放行8080/443 port

@divfor
Copy link
Author

divfor commented Sep 5, 2021

我手工加iptables -A FORWARD -p tcp --dport 8080 -j ACCEPT,这个'Failed to update apt cache: unknown reason'就解决了

@muzi502
Copy link
Member

muzi502 commented Sep 5, 2021

E: Failed to fetch file:/root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages File not found - /root/kubeplay/resources/nginx/ubuntu/amd64/bionic/Packages (2: No such file or directory)
E: Some index files failed to download. They have been ignored, or old ones used instead.

ls 看一下有没有这个目录,出现这个错误的原因就是下载的安装包版本和 OS 不匹配🤔。

@divfor
Copy link
Author

divfor commented Sep 6, 2021

没有这个目录,只有一个gz文件和两个目录:

root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# ls
archive.ubuntu.com  download.docker.com  Packages.gz

我的安装包是kubeplay-v0.1.0-alpha.3-ubuntu-bionic-amd64.tar.gz
nodes全是ubuntu server 18.04.5

@divfor
Copy link
Author

divfor commented Sep 6, 2021

关于这个local repo,我记得你有个文档提到,如果直接FROM nginx:1.9.1, 两个COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html是错的。我改成COPY --from [bionic|focal] /ubuntu /usr/share/nginx/html/ubuntu就可以了。对于上面这个,好像路径又有所不同。另外,那个文档提到type=tar可以生成tar包导入,但是entrypoint会在import时丢掉,所以内置nginx不会启动,解决这个问题需要在import的时候加上-change 'CMD /usr/sbin/nginx -g "daemon off;"' 选项

@divfor
Copy link
Author

divfor commented Sep 6, 2021

又发现2个失败点:

  1. node之前安装了较新版本的containerd,它会报告没有带允许降级选项而放弃,出错退出;

  2. 同样的kernel精确版本号4.15.0-154-generic #161-Ubuntu,有的node发现没有bridge-nf-call-iptables行,出错退出;

fatal: [node1]: FAILED! => changed=false
  msg: |-
    Failed to reload sysctl: net.ipv4.ip_forward = 1
    net.ipv4.ip_local_reserved_ports = 30000-32767
    sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory
    sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-ip6tables: No such file or directory
changed: [node2]

@muzi502
Copy link
Member

muzi502 commented Sep 6, 2021

我是使用各个 Linux 发行版 Cloud-init 镜像创建的虚拟机测试的,其他经过修改或者安装了相冲突的包是无法保证能够安装成功。

bridge-nf-call-iptables 这个是必须要开启的内核参数,建议使用全新的机器进行安装。

@divfor
Copy link
Author

divfor commented Sep 6, 2021

modprobe br_netfilter解决了这个问题
https://blog.csdn.net/shida_csdn/article/details/99571884

@divfor
Copy link
Author

divfor commented Sep 6, 2021

  1. install.sh remove不清理offline source list
root@node2:~# ll /etc/apt/sources.list.d/offline-resources.list*
-rw-r--r-- 1 root root 66 Sep  6 15:18 /etc/apt/sources.list.d/offline-resources.list
-rw-r--r-- 1 root root 66 Sep  6 14:51 /etc/apt/sources.list.d/offline-resources.list.bak
root@node2:~# apt update
Err:1 http://192.168.100.25:8080/ubuntu/amd64 bionic/ InRelease
  Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://192.168.100.25:8080/ubuntu/amd64/bionic/InRelease  Could not connect to 192.168.100.25:8080 (192.168.100.25). - connect (111: Connection refused)
W: Some index files failed to download. They have been ignored, or old ones used instead.

@divfor
Copy link
Author

divfor commented Sep 6, 2021

  1. 找不到Packages目录出错,实际目录是这样的:
root@fredvb:~/kubeplay/resources/nginx/ubuntu/amd64/bionic# tree -L 2
.
├── archive.ubuntu.com
│   └── ubuntu
├── download.docker.com
│   └── linux
└── Packages.gz

4 directories, 1 file

@divfor
Copy link
Author

divfor commented Sep 6, 2021

终于成功了一次,删除了cgroupv2,重启

===============================================================================
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates --------------------------------------------------------------------------- 4.58s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources -------------------------------------------------------------------------------------- 4.52s
download : download | Download files / images ---------------------------------------------------------------------------------------------------- 0.81s
Gather minimal facts ----------------------------------------------------------------------------------------------------------------------------- 0.65s
Gather necessary facts (hardware) ---------------------------------------------------------------------------------------------------------------- 0.60s
kubernetes-apps/ansible : Kubernetes Apps | Wait for kube-apiserver ------------------------------------------------------------------------------ 0.53s
Gather necessary facts (network) ----------------------------------------------------------------------------------------------------------------- 0.42s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm CoreDNS ------------------------------------------------------------------------------- 0.35s
kubernetes-apps/ansible : Kubernetes Apps | Register coredns deployment annotation `createdby` --------------------------------------------------- 0.31s
kubernetes-apps/ansible : Kubernetes Apps | Delete kubeadm Kube-DNS service ---------------------------------------------------------------------- 0.24s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down nodelocaldns Template ----------------------------------------------------------------------- 0.19s
kubernetes-apps/metallb : Kubernetes Apps | Install and configure MetalLB ------------------------------------------------------------------------ 0.18s
kubernetes-apps/metallb : Kubernetes Apps | Set apparmor_enabled --------------------------------------------------------------------------------- 0.14s
kubespray-defaults : Set no_proxy to all assigned cluster IPs and hostnames ---------------------------------------------------------------------- 0.14s
kubernetes-apps/external_cloud_controller/openstack : External OpenStack Cloud Controller | Generate Manifests ----------------------------------- 0.13s
kubernetes-apps/container_engine_accelerator/nvidia_gpu : Container Engine Acceleration Nvidia GPU | Create manifests for nvidia accelerators ---- 0.11s
kubernetes-apps/csi_driver/cinder : Cinder CSI Driver | Write cacert file ------------------------------------------------------------------------ 0.10s
kubespray-defaults : Gather ansible_default_ipv4 from all hosts ---------------------------------------------------------------------------------- 0.10s
download : prep_download | On localhost, check if passwordless root is possible ------------------------------------------------------------------ 0.10s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down Secondary CoreDNS Template ------------------------------------------------------------------ 0.09s
✔ ######  05-cluster-apps successfully installed  ############  kubernetes cluster successfully installed  ######

@divfor
Copy link
Author

divfor commented Sep 6, 2021

这是我目前还需要手动解决

#!/bin/bash

# one shot
# iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT
# iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT
# for i in nodes; do ssh $i modprobe br_netfilter; done

for h in x99u d9020 fredvb; do
  ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*'
done

很奇怪nerdctl拉起的两个容器端口8080 443为啥不给加iptables通过

@muzi502
Copy link
Member

muzi502 commented Sep 6, 2021

这是我目前还需要手动解决

#!/bin/bash

# one shot
# iptables -A FORWARD -p tcp -m tcp --dport 443 -j ACCEPT
# iptables -A FORWARD -p tcp -m tcp --dport 8080 -j ACCEPT

for h in x99u d9020 fredvb; do
  ssh $h 'rm -rf /etc/apt/sources.list.d/offline-resources.list*'
done

这个后期会修复,移除的时候会清理这些存留的文件

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants