- Troubleshooting
- I can’t access some resources when installing Karmada
- Member cluster healthy checking does not work
- x509: certificate signed by unknown authority issue when using
karmadactl init
- karmada-webhook keeps on crashing due to “too many open files”
- ServiceAccount deployed to the Karmada control-plane don’t generate token
- Schedule failed due to “cluster(s) did not have the API resource”
Troubleshooting
I can’t access some resources when installing Karmada
Pull images from Kubernetes Image Registry (registry.k8s.io).
You can run the following commands to change the image registry in Mainland China.
sed -i'' -e "s#registry.k8s.io#registry.aliyuncs.com/google_containers#g" artifacts/deploy/karmada-etcd.yaml
sed -i'' -e "s#registry.k8s.io#registry.aliyuncs.com/google_containers#g" artifacts/deploy/karmada-apiserver.yaml
sed -i'' -e "s#registry.k8s.io#registry.aliyuncs.com/google_containers#g" artifacts/deploy/kube-controller-manager.yaml
Download the Golang package in Mainland China and run the following command before installation.
export GOPROXY=https://goproxy.cn
Member cluster healthy checking does not work
If your environment is similar to the following.
After registering member cluster to karmada with push mode, and using
kubectl get cluster
, found the cluster status was ready. Then, by opening the firewall between the member cluster and karmada, after waiting for a long time, the cluster status was also ready, not change to fail.
The cause of the problem was that the firewall did not close the already existing TCP connection between the member cluster and karmada.
- login to the node where the member cluster apiserver is located
- use the
tcpkill
command to close the tcp connection.
# ens192 is the name of the network-card used by the member cluster to communicate with karmada.
tcpkill -9 -i ens192 src host ${KARMADA_APISERVER_IP} and dst port ${MEMBER_CLUTER_APISERVER_IP}
x509: certificate signed by unknown authority issue when using karmadactl init
When using the karmadactl init
command to install Karmada, init
command raises the error log as follows:
deploy.go:55] Post "https://192.168.24.211:32443/api/v1/namespaces": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "karmada")
Cause: Karmada has been installed on the cluster before. karmada-etcd
uses the hostpath
mode to mount the local storage. When karmada-etcd
is uninstalled, residual data exists. We need to delete files in the default directory/var/lib/karmada-etcd
. If the karmadactl --etcd-data parameter is used, please delete the corresponding directory.
karmada-webhook keeps on crashing due to “too many open files”
When using hack/local-up-karmada
to install Karmada, karmada-webhook keeps on crashing raising the error log as follows:
I1121 06:33:46.144605 1 webhook.go:83] karmada-webhook version: version.Info{GitVersion:"v1.3.0-425-gf7cac365", GitCommit:"f7cac365d743e5e40493f9ad90352f30123f7f1d", GitTreeState:"dirty", BuildDate:"2022-11-21T06:25:19Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
I1121 06:33:46.167045 1 webhook.go:113] registering webhooks to the webhook server
I1121 06:33:46.169425 1 internal.go:362] "Starting server" path="/metrics" kind="metrics" addr="[::]:8080"
I1121 06:33:46.169569 1 internal.go:362] "Starting server" kind="health probe" addr="[::]:8000"
I1121 06:33:46.169670 1 shared_informer.go:285] caches populated
I1121 06:33:46.169828 1 internal.go:567] "Stopping and waiting for non leader election runnables"
I1121 06:33:46.169848 1 internal.go:571] "Stopping and waiting for leader election runnables"
I1121 06:33:46.169856 1 internal.go:577] "Stopping and waiting for caches"
I1121 06:33:46.169883 1 internal.go:581] "Stopping and waiting for webhooks"
I1121 06:33:46.169899 1 internal.go:585] "Wait completed, proceeding to shutdown the manager"
E1121 06:33:46.169909 1 webhook.go:132] webhook server exits unexpectedly: too many open files
E1121 06:33:46.169926 1 run.go:74] "command failed" err="too many open files"
It’s a resource exhaustion issue. You can fix it with:
sysctl fs.inotify.max_user_watches=16384
sysctl -w fs.inotify.max_user_watches=100000
sysctl -w fs.inotify.max_user_instances=100000
Related Issue: https://github.com/kubernetes-sigs/kind/issues/2928
ServiceAccount deployed to the Karmada control-plane don’t generate token
To improve token security and scalability, the Kubernetes community proposes KEP-1205, which aims to introduce a new mechanism to use ServiceAccount tokens instead of directly mounting secrets generated by ServiceAccount to Pods. For details, see ServiceAccount automation. This feature is called BoundServiceAccountTokenVolume
, which has GA in Kubernetes v1.22.
With the GA of the BoundServiceAccountTokenVolume
feature, the Kubernetes community considers that it is unnecessary to automatically generate tokens for ServiceAccount because it is insecure. Therefore, KEP-2799 is proposed. One purpose of this KEP is not to automatically generate token secrets for ServiceAccount, and the other purpose is to clear token secrets generated by unused ServiceAccounts.
For the first purpose, the Kubernetes provides the LegacyServiceAccountTokenNoAutoGeneration
feature gate, which has entered the Beta phase in Kubernetes v1.24. This is why the Karmada control-plane cannot generate tokens, because karmada-apiserver
v1.24 is used in Karmada. If you still want to use the previous method to generate a token secret for ServiceAccount, you can refer to this section.
Schedule failed due to “cluster(s) did not have the API resource”
Karmada detector focus only on resources with karmada-apiserver preferred version.
For example, assuming karmada-apiserver is v1.25 version, its HPA resource has both autoscaling/v1
and autoscaling/v2
version. However, since the preferred version of HPA is autoscaling/v2
, the detector will only list/watch autoscaling/v2
version. If the user creates an HPA of autoscaling/v1
, kubernetes originally generates create event of both version, but only create event of autoscaling/v2
been watched by detector.
Basing on this background, you need to pay attention to the following two points:
- When writing propagation policy, its
resourceSelector
field only supports resource with karmada-apiserver preferred version. - Member cluster apiserver should support the resource version which karmada-apiserver preferred.
Put it more specifically, still take HPA as an example, you are advised to use autoscaling/v2
HPA in both resource template and propagation policy, just like:
propagate autoscaling/v2 by select autoscaling/v2
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: test-hpa
namespace: default
spec:
behavior:
scaleUp:
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
stabilizationWindowSeconds: 0
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: d1
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: tetst-hpa-pp
spec:
placement:
clusterAffinity:
clusterNames:
- member1
resourceSelectors:
- apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: test-hpa
namespace: default
However, if you insist on propagating a autoscaling/v1
HPA template, you can still succeed if you define resourceSelector
in propagation policy as apiVersion: autoscaling/v2
, just like:
propagate autoscaling/v1 by select autoscaling/v2
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: test-hpa
spec:
maxReplicas: 5
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx
targetCPUUtilizationPercentage: 10
---
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: test-hpa-pp
spec:
placement:
clusterAffinity:
clusterNames:
- member1
resourceSelectors:
- apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: test-hpa
namespace: default
Then, Karmada finally propagates autoscaling/v2
HPA to member clusters, if your member clusters doesn’t support autoscaling/v2
version HPA, you will get propagation failure event like “cluster(s) did not have the API resource”.