NVIDIA TensorRT Inference Server

NVIDIA TensorRT Inference Server

Model serving using TRT Inference Server

NVIDIA TensorRT Inference Server is a REST and GRPC service for deep-learninginferencing of TensorRT, TensorFlow and Caffe2 models. The server isoptimized deploy machine and deep learning algorithms on both GPUs andCPUs at scale.

These instructions detail how to set up a GKE cluster suitable forrunning the NVIDIA TensorRT Inference Server and how to use theio.ksonnet.pkg.nvidia-inference-server prototype to generateKubernetes YAML and deploy to that cluster.

For more information on the NVIDIA TensorRT Inference Server see the NVIDIA TensorRTInference Server UserGuideand the NVIDIA TensorRT Inference ServerClients open-sourcerepository.

Setup

Please refer to the guide to deploying Kubeflow on GCP.

NVIDIA TensorRT Inference Server Image

The docker image for the NVIDIA TensorRT Inference Server is available on theNVIDIA GPU Cloud. Below you will add aKubernetes secret to allow you to pull this image. As initializationyou must first register at NVIDIA GPU Cloud and follow the directionsto obtain your API key. You can confirm the key is correct byattempting to login to the registry and checking that you can pull theinference server image. See Pull an Image from a PrivateRegistryfor more information about using a private registry.

$ docker login nvcr.io
Username: $oauthtoken
Password: <your-api-key>

Now use the NVIDIA GPU Cloud API key from above to create a kubernetessecret called ngc. This secret allows Kubernetes to pull theinference server image from the NVIDIA GPU Cloud registry. Replace with your API key and with your NVIDIA GPU Cloudemail. Make sure that for docker-username you specify the valueexactly as shown, including the backslash.

$ kubectl create secret docker-registry ngc --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<api-key> --docker-email=<ngc-email>

Model Repository

The inference server needs a repository of models that it will makeavailable for inferencing. You can find an example repository in theopen-source repo andinstructions on how to create your own model repository in the NVIDIAInference Server UserGuide.

For this example you will place the model repository in a Google CloudStorage bucket.

$ gsutil mb gs://inference-server-model-store

Following theseinstructions downloadthe example model repository to your system and copy it into the GCSbucket.

$ gsutil cp -r model_store gs://inference-server-model-store

Kubernetes Generation and Deploy

This section has not yet been converted to kustomize, please refer to kubeflow/website/issues/959.

Next use ksonnet to generate Kubernetes configuration for the NVIDIA TensorRTInference Server deployment and service. The –image option points tothe NVIDIA Inference Server container in the NVIDIA GPU CloudRegistry. For the current implementation youmust use the 18.08.1 container. The –modelRepositoryPath optionpoints to our GCS bucket that contains the model repository that youset up earlier.

$ ks init my-inference-server
$ cd my-inference-server
$ ks registry add kubeflow https://github.com/kubeflow/kubeflow/tree/master/kubeflow
$ ks pkg install kubeflow/nvidia-inference-server
$ ks generate nvidia-inference-server iscomp --name=inference-server --image=nvcr.io/nvidia/inferenceserver:18.08.1-py2 --modelRepositoryPath=gs://inference-server-model-store/tf_model_store

Next deploy the service.

$ ks apply default -c iscomp

Using the TensorRT Inference Server

Now that the inference server is running you can send HTTP or GRPCrequests to it to perform inferencing. By default the inferencingservice is exposed with a LoadBalancer service type. Use the followingto find the external IP for the inference service. In this case it is35.232.176.113.

$ kubectl get services
NAME         TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                                        AGE
inference-se LoadBalancer   10.7.241.36   35.232.176.113   8000:31220/TCP,8001:32107/TCP,8002:31682/TCP   1m
kubernetes   ClusterIP      10.7.240.1    <none>           443/TCP                                        1h

The inference server exposes an HTTP endpoint on port 8000, and GRPCendpoint on port 8001 and a Prometheus metrics endpoint on port8002. You can use curl to get the status of the inference server fromthe HTTP endpoint.

$ curl 35.232.176.113:8000/api/status

Follow theinstructions to buildthe inference server example image and performance clients. You canthen use these examples to send requests to the server. For example,for an image classification model use the image_client example toperform classification of an image.

$ image_client -u 35.232.176.113:8000 -m resnet50_netdef -c3 mug.jpg
Output probabilities:
batch 0: 504 (COFFEE MUG) = 0.777365267277
batch 0: 968 (CUP) = 0.213909029961
batch 0: 967 (ESPRESSO) = 0.00294389552437

Cleanup

When done use ks to remove the deployment.

$ ks delete default -c iscomp

If you create a cluster then make sure to also delete that.

$ gcloud container clusters delete myinferenceserver