Large deployments of K8s
For a large scaled deployments, consider the following configuration changes:
Tune ansible settingsfor
forks
andtimeout
vars to fit large numbers of nodes being deployed.Override containers’
foo_image_repo
vars to point to intranet registry.Override the
download_run_once: true
and/ordownload_localhost: true
.See download modes for details.Adjust the
retry_stagger
global var as appropriate. It should provide saneload on a delegate (the first K8s master node) then retrying failedpush or download operations.Tune parameters for DNS related applicationsThose are
dns_replicas
,dns_cpu_limit
,dns_cpu_requests
,dns_memory_limit
,dns_memory_requests
.Please note that limits must always be greater than or equal to requests.Tune CPU/memory limits and requests. Those are located in roles’ defaultsand named like
foo_memory_limit
,foo_memory_requests
andfoo_cpu_limit
,foo_cpu_requests
. Note that ‘Mi’ memory units for K8swill be submitted as ‘M’, if applied fordocker run
, and cpu K8s unitswill end up with the ‘m’ skipped for docker as well. This is required asdocker does not understand k8s units well.Tune
kubelet_status_update_frequency
to increase reliability of kubelet.kube_controller_node_monitor_grace_period
,kube_controller_node_monitor_period
,kube_controller_pod_eviction_timeout
for better Kubernetes reliability.Check out Kubernetes ReliabilityTune network prefix sizes. Those are
kube_network_node_prefix
,kube_service_addresses
andkube_pods_subnet
.Add calico-rr nodes if you are deploying with Calico or Canal. Nodes recoverfrom host/network interruption much quicker with calico-rr. Note thatcalico-rr role must be on a host without kube-master or kube-node role (butetcd role is okay).
Check out theInventorysection of the Getting started guide for tips on creating a large scaleAnsible inventory.
Override the
etcd_events_cluster_setup: true
store events in a separatededicated etcd instance.
For example, when deploying 200 nodes, you may want to run ansible with--forks=50
, --timeout=600
and define the retry_stagger: 60
.