Deploy Custom Python Model Server with InferenceService

Deploy Custom Python Model Server with InferenceService

When out of the box model server does not fit your need, you can build your own model server using KFServer API and use the following source to serving workflow to deploy your custom models to KServe.

Setup

Install pack CLI to build your custom model server image.

Create your custom Model Server by extending KFModel

KServe.KFModel base class mainly defines three handlers preprocess, predict and postprocess, these handlers are executed in sequence, the output of the preprocess is passed to predict as the input, the predictor handler should execute the inference for your model, the postprocess handler then turns the raw prediction result into user-friendly inference response. There is an additional load handler which is used for writing custom code to load your model into the memory from local file system or remote model storage, a general good practice is to call the load handler in the model server class __init__ function, so your model is loaded on startup and ready to serve when user is making the prediction calls.

import kserve
from typing import Dict
class AlexNetModel(kserve.KFModel):
    def __init__(self, name: str):
       super().__init__(name)
       self.name = name
       self.load()
    def load(self):
        pass
    def predict(self, request: Dict) -> Dict:
        pass
if __name__ == "__main__":
    model = AlexNetModel("custom-model")
    kserve.KFServer().start([model])

Build the custom image with Buildpacks

Buildpacks allows you to transform your inference code into images that can be deployed on KServe without needing to define the Dockerfile. Buildpacks automatically determines the python application and then install the dependencies from the requirements.txt file, it looks at the Procfile to determine how to start the model server. Here we are showing how to build the serving image manually with pack, you can also choose to use kpack to run the image build on the cloud and continuously build/deploy new versions from your source git repository.

Use pack to build and push the custom model server image

pack build --builder=heroku/buildpacks:20 ${DOCKER_USER}/custom-model:v1
docker push ${DOCKER_USER}/custom-model:v1

Parallel Inference

By default the model is loaded and inference is ran in the same process as tornado http server, if you are hosting multiple models the inference can only be run for one model at a time which limits the concurrency when you share the container for the models. KServe integrates RayServe which provides a programmable API to deploy models as separate python workers so the inference can be ran in parallel.

import kserve
from typing import Dict
from ray import serve
@serve.deployment(name="custom-model", config={"num_replicas": 2})
class AlexNetModel(kserve.KFModel):
    def __init__(self):
       self.name = "custom-model"
       super().__init__(self.name)
       self.load()
    def load(self):
        pass
    def predict(self, request: Dict) -> Dict:
        pass
if __name__ == "__main__":
    kserve.KFServer().start({"custom-model": AlexNetModel})

Modify the Procfile to web: python -m model_remote and then run the above pack command, it builds the serving image which launches each model as separate python worker and tornado webserver routes to the model workers by name.

Deploy Locally and Test

Launch the docker image built from last step with buildpack.

docker run -ePORT=8080 -p8080:8080 ${DOCKER_USER}/custom-model:v1

Send a test inference request locally

curl localhost:8080/v1/models/custom-model:predict -d @./input.json
{"predictions": [[14.861763000488281, 13.94291877746582, 13.924378395080566, 12.182709693908691, 12.00634765625]]}

Deploy the Custom Predictor on KServe

Create the InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
spec:
  predictor:
    containers:
      - name: kserve-container
        image: {username}/custom-model:v1

In the custom.yaml file edit the container image and replace {username} with your Docker Hub username.

Apply the yaml to create the InferenceService

!!! “kubectl”

kubectl apply -f custom.yaml

Expected Output

$ inferenceservice.serving.kserve.io/custom-model created

Arguments

You can supply additional command arguments on the container spec to configure the model server.

--workers: fork the specified number of model server workers(multi-processing), the default value is 1. If you start the server after model is loaded you need to make sure model object is fork friendly for multi-processing to work. Alternatively you can decorate your model server class with replicas and in this case each model server is created as a python worker independent of the server.
--http_port: the http port model server is listening on, the default port is 8080
--max_buffer_size: Max socker buffer size for tornado http client, the default limit is 10Mi.
--max_asyncio_workers: Max number of workers to spawn for python async io loop, by default it is min(32,cpu.limit + 4)

Environment Variables

You can supply additional environment variables on the container spec.

STORAGE_URI: load a model from a storage system supported by KServe e.g. pvc:// s3://. This acts the same as storageUri when using a built-in predictor. The data will be available at /mnt/models in the container. For example, the following STORAGE_URI: "pvc://my_model/model.onnx" will be accessible at /mnt/models/model.onnx

Run a prediction

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

MODEL_NAME=custom-model
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d $INPUT_PATH

Expected Output

*   Trying 169.47.250.204...
* TCP_NODELAY set
* Connected to 169.47.250.204 (169.47.250.204) port 80 (#0)
> POST /v1/models/custom-model:predict HTTP/1.1
> Host: custom-model.default.example.com
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 105339
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 232
< content-type: text/html; charset=UTF-8
< date: Wed, 26 Feb 2020 15:19:15 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 213
<
* Connection #0 to host 169.47.250.204 left intact
{"predictions": [[14.861762046813965, 13.942917823791504, 13.9243803024292, 12.182711601257324, 12.00634765625]]}