Monitor Cloud IAP Setup

Instructions for monitoring and troubleshooting Cloud IAP

Cloud Identity-Aware Proxy (Cloud IAP) isthe recommended solution for accessing your Kubeflowdeployment from outside the cluster, when running Kubeflow on Google CloudPlatform (GCP).

This document is a step-by-step guide to ensuring that your IAP-secured endpointis available, and to debugging problems that may cause the endpoint to beunavailable.

Introduction

When deploying Kubeflow using the deployment UIor the command-line interface,you choose the authentication method you want to use. One of the options isCloud IAP. This document assumes that you have already deployed Kubeflow.

Kubeflow uses the Let’s Encrypt service to providean SSL certificate for the Kubeflow UI.

Cloud IAP gives you the following benefits:

  • Users can log in in using their GCP accounts.
  • You benefit from Google’s security expertise to protect your sensitiveworkloads.

Monitoring your Cloud IAP setup

Follow these instructions to monitor your Cloud IAP setup and troubleshoot anyproblems:

  • Examine theIngressand Google Cloud Build (GCB) load balancer to make sure it is available:
  1. kubectl -n istio-system describe ingress
  2. Name: envoy-ingress
  3. Namespace: kubeflow
  4. Address: 35.244.132.160
  5. Default backend: default-http-backend:80 (10.20.0.10:8080)
  6. Events:
  7. Type Reason Age From Message
  8. ---- ------ ---- ---- -------
  9. Normal ADD 12m loadbalancer-controller kubeflow/envoy-ingress
  10. Warning Translate 12m (x10 over 12m) loadbalancer-controller error while evaluating the ingress spec: could not find service "kubeflow/envoy"
  11. Warning Translate 12m (x2 over 12m) loadbalancer-controller error while evaluating the ingress spec: error getting BackendConfig for port "8080" on service "kubeflow/envoy", err: no BackendConfig for service port exists.
  12. Warning Sync 12m loadbalancer-controller Error during sync: Error running backend syncing routine: received errors when updating backend service: googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
  13. googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
  14. Normal CREATE 11m loadbalancer-controller ip: 35.244.132.160
  15. ...

Any problems with creating the load balancer are reported as Kubernetes events in the results of the above describe command.

  • If the address isn’t set then there was a problem creating the loadbalancer.

  • The CREATE event indicates the load balancer was successfullycreated on the specified IP address.

  • The most common error is running out of GCP quota. To fix this problem,you must either increase the quota for the relevant resource on your GCPproject or delete some existing resources.

  1. kubectl -n istio-system get certificate envoy-ingress-tls -o yaml
  2. apiVersion: certmanager.k8s.io/v1alpha1
  3. kind: Certificate
  4. metadata:
  5. creationTimestamp: 2019-04-02T22:49:43Z
  6. generation: 1
  7. labels:
  8. kustomize.component: iap-ingress
  9. name: envoy-ingress-tls
  10. namespace: istio-system
  11. resourceVersion: "4803"
  12. selfLink: /apis/certmanager.k8s.io/v1alpha1/namespaces/kubeflow/certificates/envoy-ingress-tls
  13. uid: 9b137b29-5599-11e9-a223-42010a8e020c
  14. spec:
  15. acme:
  16. config:
  17. - domains:
  18. - mykubeflow.endpoints.myproject.cloud.goog
  19. http01:
  20. ingress: envoy-ingress
  21. commonName: kf-vmaster-n01.endpoints.kubeflow-ci-deployment.cloud.goog
  22. dnsNames:
  23. - mykubeflow.endpoints.myproject.cloud.goog
  24. issuerRef:
  25. kind: ClusterIssuer
  26. name: letsencrypt-prod
  27. secretName: envoy-ingress-tls
  28. status:
  29. acme:
  30. order:
  31. url: https://acme-v02.api.letsencrypt.org/acme/order/54483154/382580193
  32. conditions:
  33. - lastTransitionTime: 2019-04-02T23:00:28Z
  34. message: Certificate issued successfully
  35. reason: CertIssued
  36. status: "True"
  37. type: Ready
  38. - lastTransitionTime: null
  39. message: Order validated
  40. reason: OrderValidated
  41. status: "False"
  42. type: ValidateFailed

It can take around 10 minutes to provision a certificate after thecreation of the load balancer.

The most recent condition should be Certificate issued successfully.

The most common error is running out of Let’s Encryptquota.Let’s Encrypt enforces a quota of 5 duplicate certificates per week.

The easiest fix to quota issues is to pick a different hostname byrecreating and redeploying Kubeflow with a differentname.

For example if you originally ran the following kfctl init command:

  1. kfctl init myapp --project=myproject --config=myconfig -V

Then rerun kfctl init with a different name that you haven’t usedbefore:

  1. kfctl init myapp-unique --project=myproject --config=myconfig -V
  • Wait for the load balancer to report the back ends as healthy:
  1. kubectl describe -n istio-system ingress envoy-ingress
  2. ...
  3. Annotations:
  4. kubernetes.io/ingress.global-static-ip-name: kubeflow-ip
  5. kubernetes.io/tls-acme: true
  6. certmanager.k8s.io/issuer: letsencrypt-prod
  7. ingress.kubernetes.io/backends: {"k8s-be-31380--5e1566252944dfdb":"HEALTHY","k8s-be-32133--5e1566252944dfdb":"HEALTHY"}
  8. ...

Both backends should be reported as healthy.It can take several minutes for the load balancer to consider the back endshealthy.

The service with port 31380 is the one that handles Kubeflowtraffic. (31380 is the default port of the service istio-ingressgateway.)

If the backend is unhealthy, check the pods in istio-system:

  • kubectl get pods -n istio-system
  • The istio-ingressgateway-XX pods should be running
  • Check the logs of backend-updater-0, ingress-bootstrap-XX, iap-enabler-XX to see if there is any error
    • Now that the certificate exists, the Ingress resource should report that itis serving on HTTPS:
  1. kubectl -n istio-system get ingress
  2. NAME HOSTS ADDRESS PORTS AGE
  3. envoy-ingress mykubeflow.endpoints.myproject.cloud.goog 35.244.132.159 80, 443 1d

If you don’t see port 443, look at the Ingress events usingkubectl describe to see if there are any errors.

  • Try accessing Cloud IAP at the fully qualified domain name in your webbrowser:
  1. https://<your-fully-qualified-domain-name>

If you get SSL errors when you log in, this typically means that your SSLcertificate is still propagating. Wait a few minutes and try again. SSLpropagation can take up to 10 minutes.

If you do not see a login prompt and you get a 404 error, the configurationof Cloud IAP is not yet complete. Keep retrying for up to 10 minutes.

  • If you get an error Error: redirect_uri_mismatch after logging in, thismeans the list of OAuth authorized redirect URIs does not include your domain.

The full error message looks like the following example and includes therelevant links:

  1. The redirect URI in the request, https://mykubeflow.endpoints.myproject.cloud.goog/_gcp_gatekeeper/authenticate, does not match the ones authorized for the OAuth client.
  2. To update the authorized redirect URIs, visit: https://console.developers.google.com/apis/credentials/oauthclient/22222222222-7meeee7a9a76jvg54j0g2lv8lrsb4l8g.apps.googleusercontent.com?project=22222222222

Follow the link in the error message to find the OAuth credential being usedand add the redirect URI listed in the error message to the list ofauthorized URIs. For more information, read the guide tosetting up OAuth for Cloud IAP.

Expiry of the SSL certificate from Let’s Encrypt

Kubeflow runs an agent in your cluster to renew the Let’s Encrypt certificateautomatically. You don’t need to take any action.For more information, see the Let’s Encryptdocumentation.

For questions and support about the certificate, visitLet’s Encrypt support.