Troubleshooting Deployments on Amazon EKS
Help diagnose and fix issues you may encounter in your Kubeflow deployment
404 Command Not Found when downloading scripts
➜ https://raw.githubusercontent.com/kubeflow/kubeflow/$\{KUBEFLOW_TAG\}/scripts/download.sh | bash
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15 100 15 0 0 84 0 --:--:-- --:--:-- --:--:-- 84
bash: line 1: 404:: command not found
Please remove the escape backslashes surrounding {KUBEFLOW_TAG}
.
Environment File Not Found
+ source env.sh
/tmp/kubeflow-aws/scripts/kfctl.sh: line 485: env.sh: No such file or directory
When you run generate/apply platform/k8s, Please make sure you verify the following steps and run your command from within the ${KFAPP} folder.
kfapp already exists
+ echo 'Directory kfapp already exists'
Directory kfapp already exists
+ exit 1
This happens if you have invalid arguments when you initialize your configuration and you try to rerun command with correct arguments. The kfapp
folder already exists. Delete kfapp
and try again.
EKS Cluster Creation Failure
There are several problems that could lead to cluster creation failure. If you see some errors when creating your cluster using eksctl
, please open the CloudFormation console and check your stacks. To recover from failure, you need to follow the guidance from the eksctl
output logs. Once you understand the root cause of your failure, you can delete your cluster and rerun ${KUBEFLOW_SRC}/scripts/kfctl.sh apply platform
.
Common issues:
- The default VPC limit is 5 VPCs per region
- Invalid command arguments
+ eksctl create cluster --config-file=/tmp/cluster_config.yaml
[ℹ] using region us-west-2
[ℹ] subnets for us-west-2b - public:192.168.0.0/19 private:192.168.96.0/19
[ℹ] subnets for us-west-2c - public:192.168.32.0/19 private:192.168.128.0/19
[ℹ] subnets for us-west-2d - public:192.168.64.0/19 private:192.168.160.0/19
[ℹ] nodegroup "general" will use "ami-0280ac619ed294a8a" [AmazonLinux2/1.12]
[ℹ] importing SSH public key "/Users/ubuntu/.ssh/id_rsa.pub" as "eksctl-test-cluster-nodegroup-general-11:2a:f6:ba:b0:98:da:b4:24:db:18:3d:e3:3f:f5:fb"
[ℹ] creating EKS cluster "test-cluster" in "us-west-2" region
[ℹ] will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
[ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --name=test-cluster'
[ℹ] building cluster stack "eksctl-test-cluster-cluster"
[✖] unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-test-cluster-cluster"
[ℹ] fetching stack events in attempt to troubleshoot the root cause of the failure
[ℹ] AWS::CloudFormation::Stack/eksctl-test-cluster-cluster: ROLLBACK_IN_PROGRESS – "The following resource(s) failed to create: [InternetGateway, ServiceRole, NATIP, VPC]. . Rollback requested by user."
[✖] AWS::EC2::EIP/NATIP: CREATE_FAILED – "Resource creation cancelled"
[✖] AWS::IAM::Role/ServiceRole: CREATE_FAILED – "Resource creation cancelled"
[ℹ] AWS::EC2::EIP/NATIP: CREATE_IN_PROGRESS – "Resource creation Initiated"
[✖] AWS::EC2::VPC/VPC: CREATE_FAILED – "The maximum number of VPCs has been reached. (Service: AmazonEC2; Status Code: 400; Error Code: VpcLimitExceeded; Request ID: xxxxxxxxxx)"
[ℹ] AWS::IAM::Role/ServiceRole: CREATE_IN_PROGRESS – "Resource creation Initiated"
[ℹ] AWS::EC2::EIP/NATIP: CREATE_IN_PROGRESS
[✖] AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED – "The maximum number of internet gateways has been reached. (Service: AmazonEC2; Status Code: 400; Error Code: InternetGatewayLimitExceeded; Request ID: 7b3c9620-d1fa-4893-9e91-fb94eb3f2ef3)"
[ℹ] AWS::EC2::VPC/VPC: CREATE_IN_PROGRESS
[ℹ] AWS::IAM::Role/ServiceRole: CREATE_IN_PROGRESS
[ℹ] AWS::EC2::InternetGateway/InternetGateway: CREATE_IN_PROGRESS
[ℹ] AWS::CloudFormation::Stack/eksctl-test-cluster-cluster: CREATE_IN_PROGRESS – "User Initiated"
[ℹ] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=test-cluster'
[✖] waiting for CloudFormation stack "eksctl-test-cluster-cluster" to reach "CREATE_COMPLETE" status: ResourceNotReady: failed waiting for successful resource state
[✖] failed to create cluster "test-cluster"
Resource Not Found in delete all
+ kubectl get ns/kubeflow
Error from server (NotFound): namespaces "kubeflow" not found
+ kubectl get ns/kubeflow
Error from server (NotFound): namespaces "kubeflow" not found
+ echo 'namespace kubeflow successfully deleted.'
You can ignore any Kubernetes “resource not found” errors that occur during the deletion phase.
InvalidParameterException in UpdateCluster
+ logging_components='"api","audit","authenticator","controllerManager","scheduler"'
++ aws eks update-cluster-config --name benchmark-0402222-sunday-satur --region us-west-2 --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: No changes needed for the logging config provided
The Amazon EKS UpdateCluster
API operation will fail if you have invalid parameters. For example, if you already enabled logs in your EKS cluster, and you choose to create Kubeflow on existing cluster and also enable logs, you will get this error.
FSX Mount Failure
Mounting command: mount
Mounting arguments: -t lustre fs-0xxxxx2a216cf.us-west-2.amazonaws.com@tcp:/fsx /var/lib/kubelet/pods/224c2c96-5a91-11e9-b7e6-0a2a42c99f84/volumes/kubernetes.io~csi/fsx-static/mount
Output: mount.lustre: Can't parse NID 'fs-0xxxxx2a216cf.us-west-2.amazonaws.com@tcp:/fsx'
This mount helper should only be invoked via the mount (8) command,
e.g. mount -t lustre dev dir
usage: mount.lustre [-fhnvV] [-o <mntopt>] <device> <mountpt>
<device>: the disk device, or for a client:
<mgsnid>[:<altmgsnid>...]:/<filesystem>[/<subdir>]
<filesystem>: name of the Lustre filesystem (e.g. lustre1)
<mountpt>: filesystem mountpoint (e.g. /mnt/lustre)
-f|--fake: fake mount (updates /etc/mtab)
The Amazon FSx dnsName
is incorrect, you can delete your pod using this persistent volume claim. The next step is to delete the PV and PVC. Next correct your input and reapply the PV and PVC.
kubectl delete pod ${pod_using_pvc}
ks delete default -c ${COMPONENT}
ks param set ${COMPONENT} dnsName fs-0xxxxx2a216cf.fsx.us-west-2.amazonaws.com
ks apply default -c ${COMPONENT}
Incompatible eksctl version
If you see this error when you run apply platform
, it means your eksctl cli version is not compatible with eksctl.io
version in cluster_config.yaml. Please upgrade your eksctl and try again.v1alpha5
is introduced from 0.1.31.
We are working with eksctl team to make sure feature release support backward compatibility at least for one version.
loading config file "${KUBEFLOW_SRC}/${KFAPP}/aws_config/cluster_config.yaml": no kind "ClusterConfig" is registered for version "eksctl.io/v1alpha5" in scheme "k8s.io/client-go/kubernetes/scheme/register.go:60"