End-to-end Kubeflow on AWS
Running Kubeflow using AWS services
This guide describes how to deploy Kubeflow using AWS services such as EKS and Cognito. It consists of 3 parts, the deployment of the kubernetes infra, the deployment of the kubeflow and finally the deployment of models using KFserving.
The target audience is a member of a SRE team that builds this platform and provides a dashboard to data scientists. In turn, they can run their workflow for training in their dedicated namespace, and serve their models via a public endpoint.
AWS services used
- Managed kubernetes (EKS) started with eksctl
- Kubernetes nodegroups (in EC2 auto-scaling groups) managed by eksctl
- ALB for istio-ingressgateway in front of all virtual services
- Cognito for user and api authentication
- Certificate manager for SSL certificates
- Route53 to manage the domain
Prerequisites
Access to an AWS account via command line is required, make sure you’re able to execute aws cli commands. Install the following programs in the system from which you provision the infra (laptop or conf.management tool):
- eksctl
- kubectl
- istioctl
- kn
- kfctl
Deploy the Kubernetes cluster
This step is only required once, when building the infra for the platform.
Create a cluster.yaml file:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: aiplatform
region: eu-west-1
nodeGroups:
- name: ng
desiredCapacity: 6
instanceType: m5.xlarge
And spin off the cluster using eksctl
:
eksctl create cluster -f cluster.yaml
That starts a cloudformation stack for the EKS master and a stack for each nodegroup, in our case one. You can observe the progress of the creation in the cloudformation page in the console.
The cluster is ready when kubectl
reports that the nodes are Ready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-10-217.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
ip-192-168-28-92.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
ip-192-168-51-201.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
ip-192-168-63-25.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
ip-192-168-68-104.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
ip-192-168-77-56.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
If you’d like to change the nodegroup scaling there are two options, either via the EC2 auto-scaling group or using eksctl
:
eksctl scale nodegroup --cluster=aiplatform --nodes=4 ng
Deploy the kubernetes dashboard
To deploy the kubernetes dashboard as described in the AWS deploy kubernetes web ui, first download and install the metrics server:
To install the metrics server:
wget https://api.github.com/repos/kubernetes-sigs/metrics-server/tarball/v0.3.6
tar zxvf v0.3.6
kubectl apply -f kubernetes-sigs-metrics-server-d1f4f6f/deploy/1.8+
Validate:
kubectl get deployment metrics-server -n kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 18d
To install the dashboard and create a user to access it, first create an eks-admin
user using the following file:
apiVersion: v1
kind: ServiceAccount
metadata:
name: eks-admin
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: eks-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: eks-admin
namespace: kube-system
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc5/aio/deploy/recommended.yaml
kubectl apply -f eks-admin-service-account.yaml
To access the kubernetes dashboard bring it to your localhost with a proxy:
kubectl proxy
And then visit the dashboard on the kubernetes dashboard ui
Exposing the kubernetes dashboard via an istio virtual service is not recommended.
To login get the token using the following command:
kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep eks-admin | awk '{print $1}')
More information on creating and managing EKS clusters.
Deploy Kubeflow
In this section you will prepare the ecosystem required by kubeflow, and you will configure the kfctl.yaml file with the custom information for your environment.
Cognito and certificates
Route53
It is handy to have a domain managed by Route53 to deal with all the DNS records you will have to add (wildcard for istio-ingressgateway, validation for the certificate manager, etc).
In case your domain.com
zone is not managed by Route53, you need to delegate a subdomain management in a Route53 hosted zone, in our example we have delegated the subdomain platform.domain.com
. To do that, create a new hosted zone platform.domain.com
, copy the NS entries that will be created and in turn create these NS records in the domain.com
zone.
In the following case, we have domain.com
hosted in Godaddy and we don’t have a subdomain there. We’d like to create a subdomain that uses Amazon route53 as the DNS Service. For more details, please check document. If you already have a subdomain in your domain service, you can use Route 53 as well, check document.
As you can see, there’re four nameservers created and we need to configure them in your domain service. Add namespace record, key should be the subdomain name platform
, value is your NS server from Route53.
Note: different domain provider has different settings, you need to check guidance from your domain providers.
In order to make Cognito to use custom domain name, A record is required to resolve platform.domain.com
as root domain, which can be a Route53 Alias to the ALB as well. We can use arbitrary ip here now, once we have ALB created, we will update the value later.
If you’re not using Route53, you can point that A record anywhere.
The rest records sets in the hosted zone will be created in the next section of this guide.
Certificate Manager
Create two certificates in Certificate Manager for *.platform.domain.com
, one in N.Virginia and one in the region of your choice. That is because Cognito requires a certificate in N.Virginia in order to have a custom domain for a user pool. The second is required by the ingress-gateway in case the platform does not run in N.Virginia, in our example Dublin. For the validation of both certificates, you will be asked to create one record in the hosted zone we created above.
Cognito
Create a user pool in Cognito. Type a pool name and choose Review defaults
and Create pool
.
Create some users in Users and groups
, these are the users who will login to the central dashboard.
Add an App client
with any name and the default options.
In the App client settings
select Authorization code grant
flow and email
, openid
, aws.cognito.signin.user.admin
and profile
scopes.
Use https://kubeflow.platform.domain.com/oauth2/idpresponse
in the Callback URL(s).
In the Domain name
choose Use your domain
, type auth.platform.domain.com
and select the *.platform.domain.com
AWS managed certificate you’ve created in N.Virginia. Creating domain takes up to 15 mins.
When it’s created, it will return the Alias target
cloudfront address for which you need to create a A Record
auth.platform.domain.com
in the hosted zone.
Take note of the following 5 values:
- The ARN of the certificate from the Certificate Manager of N.Virginia ().
- The Pool ARN () of the user pool found in Cognito general settings.
- The App client id (), found in Cognito App clients.
- The
auth.platform.domain.com
as the . The name(s) of the created nodegroup(s) using the following command:
aws iam list-roles \
| jq -r ".Roles[] \
| select(.RoleName \
| startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
.RoleName"
Download and edit the kfctl manifest file:
wget https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_aws_cognito.v1.2.0.yaml
At the end of the file we can see the KfAwsPlugin
plugin section. In the spec about the cognito, you need to replace the 4 values you recorded above and the nodegroups names in the roles.
- kind: KfAwsPlugin
metadata:
name: aws
spec:
auth:
cognito:
certArn: arn:aws:acm:eu-west-1:xxxxx:certificate/xxxxxxxxxxxxx-xxxx
cognitoAppClientId: xxxxxbxxxxxx
cognitoUserPoolArn: arn:aws:cognito-idp:eu-west-1:xxxxx:userpool/eu-west-1_xxxxxx
cognitoUserPoolDomain: auth.platform.domain.com
region: eu-west-1
roles:
- eksctl-aiplatform-aws-nodegroup-ng-NodeInstanceRole-xxxxx
Now you can build the manifests and then deploy them:
kfctl build -f kfctl_aws_cognito.v1.2.0.yaml -V
kfctl apply -f kfctl_aws_cognito.v1.2.0.yaml -V
That shouldn’t take a long time. There shouldn’t by any errors, and when ready you can validate that you can see the kubeflow namespace.
At this point you will also have an ALB, it takes around 3 minutes to be ready. When ready, copy the DNS name of that load balancer and create 2 CNAME entries to it in Route53:
*.platform.domain.com
*.default.platform.domain.com
Also remember to update A record for platform.domain.com
using ALB DNS name.
Here’s the full snapshot of record sets in your hosted zone.
Add more screenshots and clear steps for e2e doc
The central dashboard should now be available at https://kubeflow.platform.domain.com the first time will redirect to Cognito for login.
Deploy knative
Download the knative manifests from https://github.com/kubeflow/manifests/tree/master/knative
Edit configmap config-domain in file knative-serving-install/base/config-map.yaml
and use the following config-domain (replace example.com):
apiVersion: v1
data:
platform.domain.com: ""
kind: ConfigMap
metadata:
labels:
serving.knative.dev/release: "v0.8.0"
name: config-domain
namespace: knative-serving
Build and apply knative:
cd knative/knative-serving-crds/base
kustomize build . | kubectl apply -f -
cd -
cd knative/knative-serving-install/base
kustomize build . | kubectl apply -f -
cd -
That will create a knative-serving namespace with all 6 pods running:
NAME READY STATUS RESTARTS AGE
activator-7746448cf9-ggk98 2/2 Running 2 18d
autoscaler-548ccfcc57-zsfpw 2/2 Running 2 18d
autoscaler-hpa-669647f4f4-mx5q7 1/1 Running 0 18d
controller-655b8c8fb8-g89x7 1/1 Running 0 18d
networking-istio-75ff868647-k95mz 1/1 Running 0 18d
webhook-5846486ff4-4ltjq 1/1 Running 0 18d
Deploy kfserving
Install KFserving using the manifest file:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/install/v0.4.1/kfserving.yaml
That will create a kfserving-system
namespace with one pod running.
Deploy models
Deploy a Tensorflow, a PyTorch and a Scikit-learn model using KFserving:
kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/tensorflow/tensorflow.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/pytorch/pytorch.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/sklearn/sklearn.yaml
Validate that all three inference services are available:
kubectl get inferenceservice
or alternatively through the knative cli:
kn service list
NAME URL LATEST AGE CONDITIONS READY REASON
pytorch-cifar10-predictor-default http://pytorch-cifar10-predictor-default.default.platform.domain.com pytorch-cifar10-predictor-default-vfz8r 18d 3 OK / 3 True
sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.platform.domain.com sklearn-iris-predictor-default-pbx2x 6d22h 3 OK / 3 True
tensorflow-flowers-predictor-default http://tensorflow-flowers-predictor-default.default.platform.domain.com tensorflow-flowers-predictor-default-6zp4q 18d 3 OK / 3 True
That simple action will load a model from google storage and serve it through the same istio ingress-gateway. It is possible to test an inference request by posting to any endpoint one of its example datapoints, by using the cookie from the browser that visited the central dashboard:
POST https://sklearn-iris-predictor-default.default.platform.domain.com/v1/models/sklearn-iris:predict HTTP/1.1
Host: sklearn-iris-predictor-default.default.platform.domain.com
Content-Type: application/json
Cookie: AWSELBAuthSessionCookie-0=TBLc8+Mz0hSZp...
{
"instances": [
[6.8, 2.8, 4.8, 1.4],
[6.0, 3.4, 4.5, 1.6]
]
}
that request will run the inference and return the classes for the two data points:
{"predictions": [1, 1]}
Store models in S3 bucket
Copy the models in s3:
gsutil -m cp -r gs://kfserving-samples/models/tensorflow/flowers s3://domain.com-models/flowers
Create a kubernetes secret to access the S3 bucket by creating a kfserving-s3-secret.yaml
file:
apiVersion: v1
kind: Secret
metadata:
name: mysecret
annotations:
serving.kubeflow.org/s3-endpoint: s3.eu-west-1.amazonaws.com
serving.kubeflow.org/s3-usehttps: "1"
serving.kubeflow.org/s3-verifyssl: "1"
serving.kubeflow.org/s3-region: eu-west-1
type: Opaque
data:
# echo -ne "AKIAxxx" | base64
awsAccessKeyID: QUtJQVhxxxVXVjQ=
awsSecretAccessKey: QzR0UnxxxVNOd0NQQQ==
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sa
secrets:
- name: mysecret
And change the inference service accordingly by creating a tensorflow.yaml
file:
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "tensorflow-flowers"
spec:
default:
predictor:
serviceAccountName: sa
tensorflow:
storageUri: "s3://domain.com-models/flowers"
Apply the changes:
kubectl apply -f kfserving-s3-secret.yaml
kubectl apply -f tensorflow.yaml
Summary and access
Overview of the installed components, endpoints and the tools used:
Debug
Custom domain is not a valid subdomain
Route53 needs a A record
to resolve root domain, we need to add this record in hosted zone. If you miss this step, check Route53 section.
Last modified 04.05.2021: refactor and refresh aws docs (#2688) (ef4cda60)