Predict Protocol - Version 2

This document proposes a predict/inference API independent of any specific ML/DL framework and model server. The proposed APIs are able to support both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server.

For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.

The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.

HTTP/REST

A compliant server must implement the health, metadata, and inference APIs described in this section.

The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.

All strings in all contexts are case-sensitive.

For KFServing the server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).

Health:

GET v2/health/live GET v2/health/ready GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready

Server Metadata:

GET v2

Model Metadata:

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

Inference:

POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

Health

A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.

Server Live

The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.

Server Ready

The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.

Model Ready

The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies.

Server Metadata

The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object.

Server Metadata Response JSON Object

A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response, is returned in the HTTP body.

  1. $metadata_server_response =
  2. {
  3. "name" : $string,
  4. "version" : $string,
  5. "extensions" : [ $string, ... ]
  6. }
  • “name” : A descriptive name for the server.
  • “version” : The server version.
  • “extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions.

Server Metadata Response JSON Error Object

A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object.

  1. $metadata_server_error_response =
  2. {
  3. "error": $string
  4. }
  • “error” : The descriptive message for the error.

Model Metadata

The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

Model Metadata Response JSON Object

A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response, is returned in the HTTP body for every successful model metadata request.

  1. $metadata_model_response =
  2. {
  3. "name" : $string,
  4. "versions" : [ $string, ... ] #optional,
  5. "platform" : $string,
  6. "inputs" : [ $metadata_tensor, ... ],
  7. "outputs" : [ $metadata_tensor, ... ]
  8. }
  • “name” : The name of the model.
  • “versions” : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
  • “platform” : The framework/backend for the model. See Platforms.
  • “inputs” : The inputs required by the model.
  • “outputs” : The outputs produced by the model.

Each model input and output tensors’ metadata is described with a $metadata_tensor object.

  1. $metadata_tensor =
  2. {
  3. "name" : $string,
  4. "datatype" : $string,
  5. "shape" : [ $number, ... ]
  6. }
  • “name” : The name of the tensor.
  • “datatype” : The data-type of the tensor elements as defined in Tensor Data Types.
  • “shape” : The shape of the tensor. Variable-size dimensions are specified as -1.

Model Metadata Response JSON Error Object

A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object.

  1. $metadata_model_error_response =
  2. {
  3. "error": $string
  4. }
  • “error” : The descriptive message for the error.

Inference

An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object. In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object. See Inference Request Examples for some example HTTP/REST requests and responses.

Inference Request JSON Object

The inference request object, identified as $inference_request, is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

  1. $inference_request =
  2. {
  3. "id" : $string #optional,
  4. "parameters" : $parameters #optional,
  5. "inputs" : [ $request_input, ... ],
  6. "outputs" : [ $request_output, ... ] #optional
  7. }
  • “id” : An identifier for this request. Optional, but if specified this identifier must be returned in the response.
  • “parameters” : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information.
  • “inputs” : The input tensors. Each input is described using the $request_input schema defined in Request Input.
  • “outputs” : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output. Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.
Request Input

The $request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch.

  1. $request_input =
  2. {
  3. "name" : $string,
  4. "shape" : [ $number, ... ],
  5. "datatype" : $string,
  6. "parameters" : $parameters #optional,
  7. "data" : $tensor_data
  8. }
  • “name” : The name of the input tensor.
  • “shape” : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
  • “datatype” : The data-type of the input tensor elements as defined in Tensor Data Types.
  • “parameters” : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
  • “data”: The contents of the tensor. See Tensor Data for more information.
Request Output

The $request_output JSON is used to request which output tensors should be returned from the model.

  1. $request_output =
  2. {
  3. "name" : $string,
  4. "parameters" : $parameters #optional,
  5. }
  • “name” : The name of the output tensor.
  • “parameters” : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.

Inference Response JSON Object

A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response, is returned in the HTTP body.

  1. $inference_response =
  2. {
  3. "model_name" : $string,
  4. "model_version" : $string #optional,
  5. "id" : $string,
  6. "parameters" : $parameters #optional,
  7. "outputs" : [ $response_output, ... ]
  8. }
  • “model_name” : The name of the model used for inference.
  • “model_version” : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response.
  • “id” : The “id” identifier given in the request, if any.
  • “parameters” : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information.
  • “outputs” : The output tensors. Each output is described using the $response_output schema defined in Response Output.
Response Output

The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch.

  1. $response_output =
  2. {
  3. "name" : $string,
  4. "shape" : [ $number, ... ],
  5. "datatype" : $string,
  6. "parameters" : $parameters #optional,
  7. "data" : $tensor_data
  8. }
  • “name” : The name of the output tensor.
  • “shape” : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
  • “datatype” : The data-type of the output tensor elements as defined in Tensor Data Types.
  • “parameters” : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
  • “data”: The contents of the tensor. See Tensor Data for more information.

Inference Response JSON Error Object

A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object.

  1. $inference_error_response =
  2. {
  3. "error": <error message string>
  4. }
  • “error” : The descriptive message for the error.

Inference Request Examples

The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.

  1. POST /v2/models/mymodel/infer HTTP/1.1
  2. Host: localhost:8000
  3. Content-Type: application/json
  4. Content-Length: <xx>
  5. {
  6. "id" : "42",
  7. "inputs" : [
  8. {
  9. "name" : "input0",
  10. "shape" : [ 2, 2 ],
  11. "datatype" : "UINT32",
  12. "data" : [ 1, 2, 3, 4 ]
  13. },
  14. {
  15. "name" : "input1",
  16. "shape" : [ 3 ],
  17. "datatype" : "BOOL",
  18. "data" : [ true ]
  19. }
  20. ],
  21. "outputs" : [
  22. {
  23. "name" : "output0"
  24. }
  25. ]
  26. }

For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.

  1. HTTP/1.1 200 OK
  2. Content-Type: application/json
  3. Content-Length: <yy>
  4. {
  5. "id" : "42"
  6. "outputs" : [
  7. {
  8. "name" : "output0",
  9. "shape" : [ 3, 2 ],
  10. "datatype" : "FP32",
  11. "data" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ]
  12. }
  13. ]
  14. }

Parameters

The $parameters JSON describes zero or more “name”/”value” pairs, where the “name” is the name of the parameter and the “value” is a $string, $number, or $boolean.

  1. $parameters =
  2. {
  3. $parameter, ...
  4. }
  5. $parameter = $string : $string | $number | $boolean

Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

Tensor Data

Tensor data must be presented in row-major order of the tensor elements. Element values must be given in “linear” order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.

Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 is problematic to communicate explicitly since there is not a standard fp16 representation across backends nor typically the programmatic support to create the fp16 representation for a JSON number.

For example, the 2-dimensional matrix:

  1. [ 1 2
  2. 4 5 ]

Can be represented in its natural format as:

  1. "data" : [ [ 1, 2 ], [ 4, 5 ] ]

Or in a flattened one-dimensional representation:

  1. "data" : [ 1, 2, 4, 5 ]

GRPC

The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section.

All strings in all contexts are case-sensitive.

The GRPC definition of the service is:

  1. //
  2. // Inference Server GRPC endpoints.
  3. //
  4. service GRPCInferenceService
  5. {
  6. // Check liveness of the inference server.
  7. rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}
  8. // Check readiness of the inference server.
  9. rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}
  10. // Check readiness of a model in the inference server.
  11. rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}
  12. // Get server metadata.
  13. rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}
  14. // Get model metadata.
  15. rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}
  16. // Perform inference using a specific model.
  17. rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
  18. }

Health

A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.

Server Live

The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are:

  1. message ServerLiveRequest {}
  2. message ServerLiveResponse
  3. {
  4. // True if the inference server is live, false if not live.
  5. bool live = 1;
  6. }

Server Ready

The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are:

  1. message ServerReadyRequest {}
  2. message ServerReadyResponse
  3. {
  4. // True if the inference server is ready, false if not ready.
  5. bool ready = 1;
  6. }

Model Ready

The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are:

  1. message ModelReadyRequest
  2. {
  3. // The name of the model to check for readiness.
  4. string name = 1;
  5. // The version of the model to check for readiness. If not given the
  6. // server will choose a version based on the model and internal policy.
  7. string version = 2;
  8. }
  9. message ModelReadyResponse
  10. {
  11. // True if the model is ready, false if not ready.
  12. bool ready = 1;
  13. }

Server Metadata

The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are:

  1. message ServerMetadataRequest {}
  2. message ServerMetadataResponse
  3. {
  4. // The server name.
  5. string name = 1;
  6. // The server version.
  7. string version = 2;
  8. // The extensions supported by the server.
  9. repeated string extensions = 3;
  10. }

Model Metadata

The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are:

  1. message ModelMetadataRequest
  2. {
  3. // The name of the model.
  4. string name = 1;
  5. // The version of the model to check for readiness. If not given the
  6. // server will choose a version based on the model and internal policy.
  7. string version = 2;
  8. }
  9. message ModelMetadataResponse
  10. {
  11. // Metadata for a tensor.
  12. message TensorMetadata
  13. {
  14. // The tensor name.
  15. string name = 1;
  16. // The tensor data type.
  17. string datatype = 2;
  18. // The tensor shape. A variable-size dimension is represented
  19. // by a -1 value.
  20. repeated int64 shape = 3;
  21. }
  22. // The model name.
  23. string name = 1;
  24. // The versions of the model available on the server.
  25. repeated string versions = 2;
  26. // The model's platform. See Platforms.
  27. string platform = 3;
  28. // The model's inputs.
  29. repeated TensorMetadata inputs = 4;
  30. // The model's outputs.
  31. repeated TensorMetadata outputs = 5;
  32. }

Inference

The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are:

  1. message ModelInferRequest
  2. {
  3. // An input tensor for an inference request.
  4. message InferInputTensor
  5. {
  6. // The tensor name.
  7. string name = 1;
  8. // The tensor data type.
  9. string datatype = 2;
  10. // The tensor shape.
  11. repeated int64 shape = 3;
  12. // Optional inference input tensor parameters.
  13. map<string, InferParameter> parameters = 4;
  14. // The tensor contents using a data-type format. This field must
  15. // not be specified if "raw" tensor contents are being used for
  16. // the inference request.
  17. InferTensorContents contents = 5;
  18. }
  19. // An output tensor requested for an inference request.
  20. message InferRequestedOutputTensor
  21. {
  22. // The tensor name.
  23. string name = 1;
  24. // Optional requested output tensor parameters.
  25. map<string, InferParameter> parameters = 2;
  26. }
  27. // The name of the model to use for inferencing.
  28. string model_name = 1;
  29. // The version of the model to use for inference. If not given the
  30. // server will choose a version based on the model and internal policy.
  31. string model_version = 2;
  32. // Optional identifier for the request. If specified will be
  33. // returned in the response.
  34. string id = 3;
  35. // Optional inference parameters.
  36. map<string, InferParameter> parameters = 4;
  37. // The input tensors for the inference.
  38. repeated InferInputTensor inputs = 5;
  39. // The requested output tensors for the inference. Optional, if not
  40. // specified all outputs produced by the model will be returned.
  41. repeated InferRequestedOutputTensor outputs = 6;
  42. // The data contained in an input tensor can be represented in "raw"
  43. // bytes form or in the repeated type that matches the tensor's data
  44. // type. To use the raw representation 'raw_input_contents' must be
  45. // initialized with data for each tensor in the same order as
  46. // 'inputs'. For each tensor, the size of this content must match
  47. // what is expected by the tensor's shape and data type. The raw
  48. // data must be the flattened, one-dimensional, row-major order of
  49. // the tensor elements without any stride or padding between the
  50. // elements. Note that the FP16 data type must be represented as raw
  51. // content as there is no specific data type for a 16-bit float
  52. // type.
  53. //
  54. // If this field is specified then InferInputTensor::contents must
  55. // not be specified for any input tensor.
  56. repeated bytes raw_input_contents = 7;
  57. }
  58. message ModelInferResponse
  59. {
  60. // An output tensor returned for an inference request.
  61. message InferOutputTensor
  62. {
  63. // The tensor name.
  64. string name = 1;
  65. // The tensor data type.
  66. string datatype = 2;
  67. // The tensor shape.
  68. repeated int64 shape = 3;
  69. // Optional output tensor parameters.
  70. map<string, InferParameter> parameters = 4;
  71. // The tensor contents using a data-type format. This field must
  72. // not be specified if "raw" tensor contents are being used for
  73. // the inference response.
  74. InferTensorContents contents = 5;
  75. }
  76. // The name of the model used for inference.
  77. string model_name = 1;
  78. // The version of the model used for inference.
  79. string model_version = 2;
  80. // The id of the inference request if one was specified.
  81. string id = 3;
  82. // Optional inference response parameters.
  83. map<string, InferParameter> parameters = 4;
  84. // The output tensors holding inference results.
  85. repeated InferOutputTensor outputs = 5;
  86. // The data contained in an output tensor can be represented in
  87. // "raw" bytes form or in the repeated type that matches the
  88. // tensor's data type. To use the raw representation 'raw_output_contents'
  89. // must be initialized with data for each tensor in the same order as
  90. // 'outputs'. For each tensor, the size of this content must match
  91. // what is expected by the tensor's shape and data type. The raw
  92. // data must be the flattened, one-dimensional, row-major order of
  93. // the tensor elements without any stride or padding between the
  94. // elements. Note that the FP16 data type must be represented as raw
  95. // content as there is no specific data type for a 16-bit float
  96. // type.
  97. //
  98. // If this field is specified then InferOutputTensor::contents must
  99. // not be specified for any output tensor.
  100. repeated bytes raw_output_contents = 6;
  101. }

Parameters

The Parameters message describes a “name”/”value” pair, where the “name” is the name of the parameter and the “value” is a boolean, integer, or string corresponding to the parameter.

Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

  1. //
  2. // An inference parameter value.
  3. //
  4. message InferParameter
  5. {
  6. // The parameter value can be a string, an int64, a boolean
  7. // or a message specific to a predefined parameter.
  8. oneof parameter_choice
  9. {
  10. // A boolean parameter value.
  11. bool bool_param = 1;
  12. // An int64 parameter value.
  13. int64 int64_param = 2;
  14. // A string parameter value.
  15. string string_param = 3;
  16. }
  17. }

Tensor Data

In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in “linear” order without any stride or padding between elements.

Using a “raw” representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231.

An alternative to the “raw” representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor’s data type.

  1. //
  2. // The data contained in a tensor represented by the repeated type
  3. // that matches the tensor's data type. Protobuf oneof is not used
  4. // because oneofs cannot contain repeated fields.
  5. //
  6. message InferTensorContents
  7. {
  8. // Representation for BOOL data type. The size must match what is
  9. // expected by the tensor's shape. The contents must be the flattened,
  10. // one-dimensional, row-major order of the tensor elements.
  11. repeated bool bool_contents = 1;
  12. // Representation for INT8, INT16, and INT32 data types. The size
  13. // must match what is expected by the tensor's shape. The contents
  14. // must be the flattened, one-dimensional, row-major order of the
  15. // tensor elements.
  16. repeated int32 int_contents = 2;
  17. // Representation for INT64 data types. The size must match what
  18. // is expected by the tensor's shape. The contents must be the
  19. // flattened, one-dimensional, row-major order of the tensor elements.
  20. repeated int64 int64_contents = 3;
  21. // Representation for UINT8, UINT16, and UINT32 data types. The size
  22. // must match what is expected by the tensor's shape. The contents
  23. // must be the flattened, one-dimensional, row-major order of the
  24. // tensor elements.
  25. repeated uint32 uint_contents = 4;
  26. // Representation for UINT64 data types. The size must match what
  27. // is expected by the tensor's shape. The contents must be the
  28. // flattened, one-dimensional, row-major order of the tensor elements.
  29. repeated uint64 uint64_contents = 5;
  30. // Representation for FP32 data type. The size must match what is
  31. // expected by the tensor's shape. The contents must be the flattened,
  32. // one-dimensional, row-major order of the tensor elements.
  33. repeated float fp32_contents = 6;
  34. // Representation for FP64 data type. The size must match what is
  35. // expected by the tensor's shape. The contents must be the flattened,
  36. // one-dimensional, row-major order of the tensor elements.
  37. repeated double fp64_contents = 7;
  38. // Representation for BYTES data type. The size must match what is
  39. // expected by the tensor's shape. The contents must be the flattened,
  40. // one-dimensional, row-major order of the tensor elements.
  41. repeated bytes bytes_contents = 8;
  42. }

Platforms

A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format “_”. The following platform names are allowed:

  • tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”.
  • tensorflow_graphdef : A TensorFlow model encoded as a GraphDef.
  • tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel.
  • onnx_onnxv1 : A ONNX model encoded for ONNX Runtime.
  • pytorch_torchscript : A PyTorch model encoded as TorchScript.
  • mxnet_mxnet: An MXNet model
  • caffe2_netdef : A Caffe2 model encoded as a NetDef.

Tensor Data Types

Tensor data types are shown in the following table along with the size of each type, in bytes.

Data TypeSize (bytes)
BOOL1
UINT81
UINT162
UINT324
UINT648
INT81
INT162
INT324
INT648
FP162
FP324
FP648
BYTESVariable (max 232)