对于 v1.0 前版本,OVN 的 RAFT 模式存在机器重启后无法重新恢复的问题,建议升级至 1.0 后的版本。在 1.0 前版本进行修复需要重启启动所有容器网络下的 Pod,并且重启机器仍然有概率出现该问题。
检查对应机器的 /var/log/openvswitch/ovsdb-server-nb.log 和 /var/log/openvswitch/ovsdb-server-sb.log,如果有下列类似的 ERR log 即可确认为是 OVN Raft 实现导致的问题
- 2020-05-15T02:52:55.703Z|00335|raft|ERR|Dropped 15 log messages in last 13 seconds (most recently, 2 seconds ago) due to excessive rate
- 2020-05-15T02:52:55.703Z|00336|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53161 last_log_index=61188 last_log_term=52219
- 2020-05-15T02:53:06.803Z|00337|raft|ERR|Dropped 15 log messages in last 11 seconds (most recently, 2 seconds ago) due to excessive rate
- 2020-05-15T02:53:06.803Z|00338|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53169 last_log_index=61188 last_log_term=52219
- 2020-05-15T02:53:18.409Z|00339|raft|ERR|Dropped 13 log messages in last 12 seconds (most recently, 2 seconds ago) due to excessive rate
- 2020-05-15T02:53:18.409Z|00340|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53176 last_log_index=61188 last_log_term=52219
- 2020-05-15T02:53:30.920Z|00341|raft|ERR|Dropped 15 log messages in last 12 seconds (most recently, 1 seconds ago) due to excessive rate
- 2020-05-15T02:53:30.920Z|00342|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53184 last_log_index=61188 last_log_term=52219
修复方式:
- 记录 ovn-central 和 kube-ovn-controller 当前的实例数和所在机器,停掉 ovn-central 和 kube-ovn-controller
- kubectl scale deployment -n kube-ovn --replicas=0 ovn-central
- kubectl scale deployment -n kube-ovn --replicas=0 kube-ovn-controller
- 删除所有 ovn-central 所在机器 /etc/origin/openvswitch 目录下的 db 数据
- rm -rf /etc/origin/openvswitch/ovnnb_db.db
- rm -rf /etc/origin/openvswitch/ovnsb_db.db
- 备份并删除 metis 的 webhook 避免 pod 无法创建
- kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io metis -o yaml metis.yaml
- kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io metis
- scale ovn-central 至之前的实例数,并等待所有Pod ready
- kubectl scale deployment -n kube-ovn ovn-central --replicas=<XXX>
- sclae kube-ovn-controller 至之前的实例数,并等待所有 Pod Ready
- kubectl scale deployment -n kube-ovn kube-ovn-controller --replicas=<XXX>
- 删除所有容器网络模式下的 Pod 进行重建
- for ns in $(kubectl get ns --no-headers -o custom-columns=NAME:.metadata.name); do
- for pod in $(kubectl get pod --no-headers -n "$ns" --field-selector spec.restartPolicy=Always -o custom-columns=NAME:.metadata.name,HOST:spec.hostNetwork | awk '{if ($2!="true") print $1}'); do
- kubectl delete pod "$pod" -n "$ns"
- done
- done
- 重建 metis 的 webhook 删除掉 metis.yaml 中的 createionTimestamp,resourceVersion,seflLink,uid 等字段,进行重建
- kubectl apply -f metis.yaml