常见问题 - 安装 - 《Rancher 2.4.8 中文文档》

Agent 无法连接 Rancher server
- ERROR: https://x.x.x.x/ping is not accessible (Failed to connect to x.x.x.x port 443: Connection timed out)
- ERROR: https://rancher.my.org/ping is not accessible (Could not resolve host: rancher.my.org)
创建 Kubernetes 集群，ETCD 无法启动

Agent 无法连接 Rancher server

ERROR: `https://x.x.x.x/ping` is not accessible (Failed to connect to x.x.x.x port 443: Connection timed out)

ERROR: https://x.x.x.x/ping is not accessible (Failed to connect to x.x.x.x port 443: Connection timed out)

在cattle-cluster-agent或cattle-node-agent中出现以上错误，代表 agent 无法连接到 rancher server，请按照以下步骤排查网络连接：

从 agent 宿主机访问 rancher server 的 443 端口，例如：telnet x.x.x.x 443
从容器内访问 rancher server 的 443 端口，例如：telnet x.x.x.x 443

ERROR: `https://rancher.my.org/ping` is not accessible (Could not resolve host: rancher.my.org)

ERROR: https://rancher.my.org/ping is not accessible (Could not resolve host: rancher.my.org)

在cattle-cluster-agent或cattle-node-agent中出现以上错误，代表 agent 无法通过域名解析到 rancher server，请按照以下步骤进行排查网络连接：

从容器内访问通过域名访问 rancher server，例如：ping rancher.my.org

这个问题在内网并且无 DNS 服务器的环境下非常常见，即使在/etc/hosts 文件中配置了映射关系也无法解决，这是因为cattle-node-agent从宿主机的/etc/resolv.conf 中继承nameserver用作 dns 服务器。

所以要解决这个问题，可以在环境中搭建一个 dns 服务器，配置正确的域名和 IP 的对应关系，然后将每个节点的nameserver指向这个 dns 服务器。

或者使用HostAliases

kubectl -n cattle-system patch  deployments cattle-cluster-agent --patch '{
    "spec": {
        "template": {
            "spec": {
                "hostAliases": [
                    {
                      "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                      "ip": "{{ rancher_server_ip }}"
                    }
                ]
            }
        }
    }
}'
kubectl -n cattle-system patch  daemonsets cattle-node-agent --patch '{
 "spec": {
     "template": {
         "spec": {
             "hostAliases": [
                 {
                    "hostnames":
                      [
                        "{{ rancher_server_hostname }}"
                      ],
                    "ip": "{{ rancher_server_ip }}"
                 }
             ]
         }
     }
 }
}'

创建 Kubernetes 集群，ETCD 无法启动

通过rke 创建 Kubernetes 集群，集群状态为Provisioning，并且 UI 显示如下错误信息：

[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.0.2.15] failed to report healthy. Check etcd container logs on each host for more information

查看 etcd 日志，显示如下错误信息：

2020-05-25 08:43:41.515364 I | embed: ready to serve client requests
2020-05-25 08:43:41.523589 I | embed: serving client requests on [::]:2379
2020-05-25 08:43:41.536538 I | embed: rejected connection from "10.0.2.15:39550" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2020-05-25 08:43:46.545930 I | embed: rejected connection from "10.0.2.15:39554" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2020-05-25 08:43:51.554070 I | embed: rejected connection from "10.0.2.15:39556" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2020-05-25 08:44:34.072012 I | embed: rejected connection from "10.0.2.15:39703" (error "EOF", ServerName "")
2020-05-25 08:44:46.520865 I | embed: rejected connection from "10.0.2.15:39560" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

以上报错是因为证书的问题，导致 etcd 启动失败。原因主要有两种可能：

主机时钟不同步
该主机之前添加过 kubernetes 集群，在残留数据没有清理干净的情况下重新安装集群。

解决办法：

检查主机时钟，并使各主机时钟同步。
参考清理节点说明，将主机数据残留数据清理干净，然后再从新添加集群。