Monitor Cloud IAP Setup

Instructions for monitoring and troubleshooting Cloud IAP

Cloud Identity-Aware Proxy (Cloud IAP) is the recommended solution for accessing your Kubeflow deployment from outside the cluster, when running Kubeflow on Google Cloud.

This document is a step-by-step guide to ensuring that your IAP-secured endpoint is available, and to debugging problems that may cause the endpoint to be unavailable.

Introduction

When deploying Kubeflow using the command-line interface, you choose the authentication method you want to use. One of the options is Cloud IAP. This document assumes that you have already deployed Kubeflow.

Kubeflow uses the Google-managed certificate to provide an SSL certificate for the Kubeflow Ingress.

Cloud IAP gives you the following benefits:

  • Users can log in in using their Google Cloud accounts.
  • You benefit from Google’s security expertise to protect your sensitive workloads.

Monitoring your Cloud IAP setup

Follow these instructions to monitor your Cloud IAP setup and troubleshoot any problems:

  1. Examine the Ingress and Google Cloud Build (GCB) load balancer to make sure it is available:

    1. kubectl -n istio-system describe ingress
    2. Name: envoy-ingress
    3. Namespace: kubeflow
    4. Address: 35.244.132.160
    5. Default backend: default-http-backend:80 (10.20.0.10:8080)
    6. Annotations:
    7. ...
    8. Events:
    9. Type Reason Age From Message
    10. ---- ------ ---- ---- -------
    11. Normal ADD 12m loadbalancer-controller kubeflow/envoy-ingress
    12. Warning Translate 12m (x10 over 12m) loadbalancer-controller error while evaluating the ingress spec: could not find service "kubeflow/envoy"
    13. Warning Translate 12m (x2 over 12m) loadbalancer-controller error while evaluating the ingress spec: error getting BackendConfig for port "8080" on service "kubeflow/envoy", err: no BackendConfig for service port exists.
    14. Warning Sync 12m loadbalancer-controller Error during sync: Error running backend syncing routine: received errors when updating backend service: googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
    15. googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
    16. Normal CREATE 11m loadbalancer-controller ip: 35.244.132.160
    17. ...

    There should be an annotation indicating that we are using managed certificate:

    1. annotation:
    2. networking.gke.io/managed-certificates: gke-certificate

    Any problems with creating the load balancer are reported as Kubernetes events in the results of the above describe command.

    • If the address isn’t set then there was a problem creating the load balancer.

    • The CREATE event indicates the load balancer was successfully created on the specified IP address.

    • The most common error is running out of Google Cloud resource quota. To fix this problem, you must either increase the quota for the relevant resource on your Google Cloud project or delete some existing resources.

  2. Verify that a managed certificate resource is generated:

    1. kubectl describe -n istio-system managedcertificate gke-certificate

    The status field should have information about the current status of the Certificate. Eventually, certificate status should be Active.

  3. Wait for the load balancer to report the back ends as healthy:

    1. kubectl describe -n istio-system ingress envoy-ingress
    2. ...
    3. Annotations:
    4. kubernetes.io/ingress.global-static-ip-name: kubeflow-ip
    5. kubernetes.io/tls-acme: true
    6. certmanager.k8s.io/issuer: letsencrypt-prod
    7. ingress.kubernetes.io/backends: {"k8s-be-31380--5e1566252944dfdb":"HEALTHY","k8s-be-32133--5e1566252944dfdb":"HEALTHY"}
    8. ...

    Both backends should be reported as healthy. It can take several minutes for the load balancer to consider the back ends healthy.

    The service with port 31380 is the one that handles Kubeflow traffic. (31380 is the default port of the service istio-ingressgateway.)

    If the backend is unhealthy, check the pods in istio-system:

    • kubectl get pods -n istio-system
    • The istio-ingressgateway-XX pods should be running
    • Check the logs of pod backend-updater-0, iap-enabler-XX to see if there is any error
    • Follow the steps here to check the load balancer and backend service on Google Cloud.
  4. Try accessing Cloud IAP at the fully qualified domain name in your web browser:

    1. https://<your-fully-qualified-domain-name>

    If you get SSL errors when you log in, this typically means that your SSL certificate is still propagating. Wait a few minutes and try again. SSL propagation can take up to 10 minutes.

    If you do not see a login prompt and you get a 404 error, the configuration of Cloud IAP is not yet complete. Keep retrying for up to 10 minutes.

  5. If you get an error Error: redirect_uri_mismatch after logging in, this means the list of OAuth authorized redirect URIs does not include your domain.

    The full error message looks like the following example and includes the relevant links:

    1. The redirect URI in the request, https://mykubeflow.endpoints.myproject.cloud.goog/_gcp_gatekeeper/authenticate, does not match the ones authorized for the OAuth client.
    2. To update the authorized redirect URIs, visit: https://console.developers.google.com/apis/credentials/oauthclient/22222222222-7meeee7a9a76jvg54j0g2lv8lrsb4l8g.apps.googleusercontent.com?project=22222222222

    Follow the link in the error message to find the OAuth credential being used and add the redirect URI listed in the error message to the list of authorized URIs. For more information, read the guide to setting up OAuth for Cloud IAP.

Next steps

Last modified 17.05.2021: (gke) Full Kubeflow Deployment guide on GKE (revision for Kubeflow 1.3) (#2686) (f50693a3)