diff --git a/master/404.html b/master/404.html index c9078ccf1..b8b999987 100644 --- a/master/404.html +++ b/master/404.html @@ -824,6 +824,53 @@

+ + + + + + +
  • + + + + + + + + + + +
  • + + + + diff --git a/master/admin/kubernetes_deployment/index.html b/master/admin/kubernetes_deployment/index.html index 7672aeb31..7ab7296d4 100644 --- a/master/admin/kubernetes_deployment/index.html +++ b/master/admin/kubernetes_deployment/index.html @@ -391,6 +391,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/admin/migration/index.html b/master/admin/migration/index.html index 7bb2a3c5b..53e5990b4 100644 --- a/master/admin/migration/index.html +++ b/master/admin/migration/index.html @@ -381,6 +381,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/admin/modelmesh/index.html b/master/admin/modelmesh/index.html index dad699bef..bb74bbfb3 100644 --- a/master/admin/modelmesh/index.html +++ b/master/admin/modelmesh/index.html @@ -381,6 +381,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/admin/serverless/kourier_networking/index.html b/master/admin/serverless/kourier_networking/index.html index 166c07f4e..adc2fddbe 100644 --- a/master/admin/serverless/kourier_networking/index.html +++ b/master/admin/serverless/kourier_networking/index.html @@ -395,6 +395,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/admin/serverless/serverless/index.html b/master/admin/serverless/serverless/index.html index 4327730e3..80ea8e812 100644 --- a/master/admin/serverless/serverless/index.html +++ b/master/admin/serverless/serverless/index.html @@ -401,6 +401,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/admin/serverless/servicemesh/index.html b/master/admin/serverless/servicemesh/index.html index 364c81180..07b0071da 100644 --- a/master/admin/serverless/servicemesh/index.html +++ b/master/admin/serverless/servicemesh/index.html @@ -359,6 +359,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/api/api/index.html b/master/api/api/index.html index caa7ba99c..854fc22c2 100644 --- a/master/api/api/index.html +++ b/master/api/api/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/_index/index.html b/master/blog/_index/index.html index 27747562a..58d9f615b 100644 --- a/master/blog/_index/index.html +++ b/master/blog/_index/index.html @@ -355,6 +355,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2021-09-27-kfserving-transition/index.html b/master/blog/articles/2021-09-27-kfserving-transition/index.html index f012bd590..ec632c02b 100644 --- a/master/blog/articles/2021-09-27-kfserving-transition/index.html +++ b/master/blog/articles/2021-09-27-kfserving-transition/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2021-10-11-KServe-0.7-release/index.html b/master/blog/articles/2021-10-11-KServe-0.7-release/index.html index 44bc77bd4..331d9c453 100644 --- a/master/blog/articles/2021-10-11-KServe-0.7-release/index.html +++ b/master/blog/articles/2021-10-11-KServe-0.7-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2022-02-18-KServe-0.8-release/index.html b/master/blog/articles/2022-02-18-KServe-0.8-release/index.html index a948cdde4..83b566553 100644 --- a/master/blog/articles/2022-02-18-KServe-0.8-release/index.html +++ b/master/blog/articles/2022-02-18-KServe-0.8-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2022-07-21-KServe-0.9-release/index.html b/master/blog/articles/2022-07-21-KServe-0.9-release/index.html index 7cb39a78b..2459252d4 100644 --- a/master/blog/articles/2022-07-21-KServe-0.9-release/index.html +++ b/master/blog/articles/2022-07-21-KServe-0.9-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2023-02-05-KServe-0.10-release/index.html b/master/blog/articles/2023-02-05-KServe-0.10-release/index.html index bf31ac49b..a104d9d66 100644 --- a/master/blog/articles/2023-02-05-KServe-0.10-release/index.html +++ b/master/blog/articles/2023-02-05-KServe-0.10-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2023-10-08-KServe-0.11-release/index.html b/master/blog/articles/2023-10-08-KServe-0.11-release/index.html index d6e40e47e..4a47b84c4 100644 --- a/master/blog/articles/2023-10-08-KServe-0.11-release/index.html +++ b/master/blog/articles/2023-10-08-KServe-0.11-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/2024-05-15-KServe-0.13-release/index.html b/master/blog/articles/2024-05-15-KServe-0.13-release/index.html index 42023c10a..650d682da 100644 --- a/master/blog/articles/2024-05-15-KServe-0.13-release/index.html +++ b/master/blog/articles/2024-05-15-KServe-0.13-release/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/blog/articles/_index/index.html b/master/blog/articles/_index/index.html index 5cc0c42d4..7005790fc 100644 --- a/master/blog/articles/_index/index.html +++ b/master/blog/articles/_index/index.html @@ -355,6 +355,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/community/adopters/index.html b/master/community/adopters/index.html index 2e5008ba4..448186e6a 100644 --- a/master/community/adopters/index.html +++ b/master/community/adopters/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/community/get_involved/index.html b/master/community/get_involved/index.html index 7b7972e1f..82d4e2ea6 100644 --- a/master/community/get_involved/index.html +++ b/master/community/get_involved/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/community/presentations/index.html b/master/community/presentations/index.html index 71c39ce7a..6987cf101 100644 --- a/master/community/presentations/index.html +++ b/master/community/presentations/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/developer/debug/index.html b/master/developer/debug/index.html index 254c2195b..0bf218c45 100644 --- a/master/developer/debug/index.html +++ b/master/developer/debug/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/developer/developer/index.html b/master/developer/developer/index.html index e35dec4bb..ded3c14c9 100644 --- a/master/developer/developer/index.html +++ b/master/developer/developer/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/get_started/first_isvc/index.html b/master/get_started/first_isvc/index.html index f3ef41e83..bb4e6dae7 100644 --- a/master/get_started/first_isvc/index.html +++ b/master/get_started/first_isvc/index.html @@ -410,6 +410,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/get_started/index.html b/master/get_started/index.html index ee09f99c4..44e2d605b 100644 --- a/master/get_started/index.html +++ b/master/get_started/index.html @@ -395,6 +395,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/get_started/swagger_ui/index.html b/master/get_started/swagger_ui/index.html index a75d6deb0..6f4538951 100644 --- a/master/get_started/swagger_ui/index.html +++ b/master/get_started/swagger_ui/index.html @@ -386,6 +386,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/github/index.html b/master/help/contributor/github/index.html index 3273f880b..16a589540 100644 --- a/master/help/contributor/github/index.html +++ b/master/help/contributor/github/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/mkdocs-contributor-guide/index.html b/master/help/contributor/mkdocs-contributor-guide/index.html index c504bdabd..25ee0d7c4 100644 --- a/master/help/contributor/mkdocs-contributor-guide/index.html +++ b/master/help/contributor/mkdocs-contributor-guide/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/templates/template-blog/index.html b/master/help/contributor/templates/template-blog/index.html index dcefa9c0c..01ba89449 100644 --- a/master/help/contributor/templates/template-blog/index.html +++ b/master/help/contributor/templates/template-blog/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/templates/template-concept/index.html b/master/help/contributor/templates/template-concept/index.html index 1335a7b15..ddac594e3 100644 --- a/master/help/contributor/templates/template-concept/index.html +++ b/master/help/contributor/templates/template-concept/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/templates/template-procedure/index.html b/master/help/contributor/templates/template-procedure/index.html index f771035d3..6864b34ab 100644 --- a/master/help/contributor/templates/template-procedure/index.html +++ b/master/help/contributor/templates/template-procedure/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/contributor/templates/template-troubleshooting/index.html b/master/help/contributor/templates/template-troubleshooting/index.html index 68bbb7558..2b86ea65d 100644 --- a/master/help/contributor/templates/template-troubleshooting/index.html +++ b/master/help/contributor/templates/template-troubleshooting/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/style-guide/documenting-code/index.html b/master/help/style-guide/documenting-code/index.html index 2aa160850..d1e95ef08 100644 --- a/master/help/style-guide/documenting-code/index.html +++ b/master/help/style-guide/documenting-code/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/style-guide/style-and-formatting/index.html b/master/help/style-guide/style-and-formatting/index.html index 1669dc945..71bf4fa3e 100644 --- a/master/help/style-guide/style-and-formatting/index.html +++ b/master/help/style-guide/style-and-formatting/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/help/style-guide/voice-and-language/index.html b/master/help/style-guide/voice-and-language/index.html index 7bfa6c78e..09fe55cd9 100644 --- a/master/help/style-guide/voice-and-language/index.html +++ b/master/help/style-guide/voice-and-language/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/index.html b/master/index.html index f9ce17f47..2d54cf32b 100644 --- a/master/index.html +++ b/master/index.html @@ -356,6 +356,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/autoscaling/autoscaling/index.html b/master/modelserving/autoscaling/autoscaling/index.html index 674717e0e..446dbe8c9 100644 --- a/master/modelserving/autoscaling/autoscaling/index.html +++ b/master/modelserving/autoscaling/autoscaling/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/batcher/batcher/index.html b/master/modelserving/batcher/batcher/index.html index fb7c0bf70..f0f4ef264 100644 --- a/master/modelserving/batcher/batcher/index.html +++ b/master/modelserving/batcher/batcher/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/certificate/kserve/index.html b/master/modelserving/certificate/kserve/index.html index 35e87f75b..963bb610d 100644 --- a/master/modelserving/certificate/kserve/index.html +++ b/master/modelserving/certificate/kserve/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/control_plane/index.html b/master/modelserving/control_plane/index.html index 0a249b799..21009ba81 100644 --- a/master/modelserving/control_plane/index.html +++ b/master/modelserving/control_plane/index.html @@ -376,6 +376,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/data_plane/binary_tensor_data_extension/index.html b/master/modelserving/data_plane/binary_tensor_data_extension/index.html new file mode 100644 index 000000000..3d22614e5 --- /dev/null +++ b/master/modelserving/data_plane/binary_tensor_data_extension/index.html @@ -0,0 +1,1643 @@ + + + + + + + + + + + +Binary Tensor Data Extension - KServe Documentation Website + + + + + + + + + + + + + +
    + + Skip to content + +
    +
    + +
    + +
    + + +
    +
    +
    +
    +
    +
    +
    + +
    +
    +
    + +
    +
    + + + +

    Binary Tensor Data Extension¶

    +

    The Binary Tensor Data Extension allows clients to send and receive tensor data in a binary format in +the body of an HTTP/REST request. This extension is particularly useful for sending and receiving FP16 data as +there is no specific data type for a 16-bit float type in the Open Inference Protocol and large tensors +for high-throughput scenarios.

    +

    Overview¶

    +

    Tensor data represented as binary data is organized in little-endian byte order, row major, without stride or +padding between elements. All tensor data types are representable as binary data in the native size of the data type. +For BOOL type element true is a single byte with value 1 and false is a single byte with value 0. +For BYTES type an element is represented by a 4-byte unsigned integer giving the length followed by the actual bytes. +The binary data for a tensor is delivered in the HTTP body after the JSON object (see Examples).

    +

    The binary tensor data extension uses parameters to indicate that an input or output tensor is communicated as binary data.

    +

    The binary_data_size parameter is used in $request_input and $response_output to indicate that the input or output tensor is communicated as binary data:

    +
      +
    • "binary_data_size" : int64 parameter indicating the size of the tensor binary data, in bytes.
    • +
    +

    The binary_data parameter is used in $request_output to indicate that the output should be returned from KServe runtime +as binary data.

    +
      +
    • "binary_data" : bool parameter that is true if the output should be returned as binary data and false (or not given) if the + tensor should be returned as JSON.
    • +
    +

    The binary_data_output parameter is used in $inference_request to indicate that all outputs should be returned from KServe runtime as binary data, unless overridden by "binary_data" on a specific output.

    +
      +
    • "binary_data_output" : bool parameter that is true if all outputs should be returned as binary data and false + (or not given) if the outputs should be returned as JSON. If "binary_data" is specified on an output it overrides this setting.
    • +
    +

    When one or more tensors are communicated as binary data, the HTTP body of the request or response +will contain the JSON inference request or response object followed by the binary tensor data in the same order as the +order of the input or output tensors are specified in the JSON.

    +
      +
    • If any binary data is present in the request or response the Inference-Header-Content-Length header must be provided to + give the length of the JSON object, and Content-Length continues to give the full body length (as HTTP requires).
    • +
    +

    Examples¶

    +

    Sending and Receiving Binary Data¶

    +

    For the following request the input tensors input0 and input2 are sent as binary data while input1 is sent as non-binary data. Note that the input0 and input2 input tensors have a parameter binary_data_size which represents the size of the binary data.

    +

    The output tensor output0 must be returned as binary data as that is what is requested by setting the binary_data parameter to true. Also note that the size of the JSON part is provided in the Inference-Header-Content-Length and the total size of the binary data is reflected in the Content-Length header.

    +
    POST /v2/models/mymodel/infer HTTP/1.1
    +Host: localhost:8000
    +Content-Type: application/octet-stream
    +Inference-Header-Content-Length: <xx> # Json length
    +Content-Length: <xx+19>     # Json length + binary data length (In this case 16 + 3 = 19)
    +{
    +  "model_name" : "mymodel",
    +  "inputs" : [
    +    {
    +      "name" : "input0",
    +      "shape" : [ 2, 2 ],
    +      "datatype" : "FP16",
    +      "parameters" : {
    +        "binary_data_size" : 16
    +      }
    +    },
    +    {
    +      "name" : "input1",
    +      "shape" : [ 2, 2 ],
    +      "datatype" : "UINT32",
    +      "data": [[1, 2], [3, 4]]
    +    },
    +    {
    +      "name" : "input2",
    +      "shape" : [ 3 ],
    +      "datatype" : "BOOL",
    +      "parameters" : {
    +        "binary_data_size" : 3
    +      }
    +    }
    +  ],
    +  "outputs" : [
    +    {
    +      "name" : "output0",
    +      "parameters" : {
    +        "binary_data" : true
    +      }
    +    },
    +    {
    +      "name" : "output1"
    +    }
    +  ]
    +}
    +<16 bytes of data for input0 tensor>
    +<3 bytes of data for input2 tensor>
    +
    +

    Assuming the model returns a [ 3, 2 ] tensor of data type FP16 and a [2, 2] tensor of data type FP32 the following response would be returned.

    +
    HTTP/1.1 200 OK
    +Content-Type: application/octet-stream
    +Inference-Header-Content-Length: <yy>  # Json length
    +Content-Length: <yy+16>   # Json length + binary data length (In this case 16)
    +{
    +  "outputs" : [
    +    {
    +      "name" : "output0",
    +      "shape" : [ 3, 2 ],
    +      "datatype"  : "FP16",
    +      "parameters" : {
    +        "binary_data_size" : 16
    +      }
    +    },
    +    {
    +      "name" : "output1",
    +      "shape" : [ 2, 2 ],
    +      "datatype"  : "FP32",
    +      "data" : [[1.203, 5.403], [3.434, 34.234]]
    +    }
    +  ]
    +}
    +<16 bytes of data for output0 tensor>
    +
    +
    +
    +
    +
    +
    +
    from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
    +from kserve.protocol.infer_type import RequestedOutput
    +from kserve.inference_client import RESTConfig
    +
    +fp16_data = np.array([[1.1, 2.22], [3.345, 4.34343]], dtype=np.float16)
    +uint32_data = np.array([[1, 2], [3, 4]], dtype=np.uint32)
    +bool_data = np.array([True, False, True], dtype=np.bool)
    +
    +# Create input tensor with binary data
    +input_0 = InferInput(name="input_0", datatype="FP16", shape=[2, 2])
    +input_0.set_data_from_numpy(fp16_data, binary_data=True)
    +input_1 = InferInput(name="input_1", datatype="UINT32", shape=[2, 2])
    +input_1.set_data_from_numpy(uint32_data, binary_data=False)
    +input_2 = InferInput(name="input_2", datatype="BOOL", shape=[3])
    +input_2.set_data_from_numpy(bool_data, binary_data=True)
    +
    +# Create request output
    +output_0 = RequestedOutput(name="output_0", binary_data=True)
    +output_1 = RequestedOutput(name="output_1", binary_data=False)
    +
    +# Create inference request
    +infer_request = InferRequest(
    +    model_name="mymodel",
    +    request_id="2ja0ls9j1309",
    +    infer_inputs=[input_0, input_1, input_2],
    +    requested_outputs=[output_0, output_1],
    +)
    +
    +# Create the REST client
    +config = RESTConfig(verbose=True, protocol="v2")
    +rest_client = InferenceRESTClient(config=config)
    +
    +# Send the request
    +infer_response = await rest_client.infer(
    +          "http://localhost:8000",
    +          model_name="TestModel",
    +          data=infer_request,
    +          headers={"Host": "test-server.com"},
    +          timeout=2,
    +      )
    +
    +# Read the binary data from the response
    +output_0 = infer_response.outputs[0]
    +fp16_output = output_0.as_numpy()
    +
    +# Read the non-binary data from the response
    +output_1 = infer_response.outputs[1]
    +fp32_output = output_1.data # This will return the data as a list
    +fp32_output_arr = output_1.as_numpy()
    +
    +

    Requesting All The Outputs To Be In Binary Format¶

    +

    For the following request, binary_data_output is set to true to receive all the outputs as binary data. Note that the +binary_data_output is set in the $inference_request parameters field, not in the $inference_input parameters field. This parameter can be overridden for a specific output by setting binary_data parameter to false in the $request_output.

    +

    POST /v2/models/mymodel/infer HTTP/1.1
    +Host: localhost:8000
    +Content-Type: application/json
    +Content-Length: 75
    +{
    +  "model_name": "my_model",
    +  "inputs": [
    +    {
    +      "name": "input_tensor",
    +      "datatype": "FP32",
    +      "shape": [1, 2],
    +      "data": [[32.045, 399.043]],
    +    }
    +  ],
    +  "parameters": {
    +     "binary_data_output": true
    +  }
    +}
    +
    +Assuming the model returns a [ 3, 2 ] tensor of data type FP16 and a [2, 2] tensor of data type FP32 the following response would be returned.

    +
    HTTP/1.1 200 OK
    +Content-Type: application/octet-stream
    +Inference-Header-Content-Length: <yy>  # Json length
    +Content-Length: <yy+48>   # Json length + binary data length (In this case 16 + 32)
    +{
    +  "outputs" : [
    +    {
    +      "name" : "output_tensor0",
    +      "shape" : [ 3, 2 ],
    +      "datatype"  : "FP16",
    +      "parameters" : {
    +        "binary_data_size" : 16
    +      }
    +    },
    +    {
    +      "name" : "output_tensor1",
    +      "shape" : [ 2, 2 ],
    +      "datatype"  : "FP32",
    +      "parameters": {
    +        "binary_data_size": 32
    +      }
    +    }
    +  ]
    +}
    +<16 bytes of data for output_tensor0 tensor>
    +<32 bytes of data for output_tensor1 tensor>
    +
    +
    +
    +
    +
    +
    +
    from kserve import ModelServer, InferenceRESTClient, InferRequest, InferInput
    +from kserve.protocol.infer_type import RequestedOutput
    +from kserve.inference_client import RESTConfig
    +
    +fp32_data = np.array([[32.045, 399.043]], dtype=np.float32)
    +
    +# Create the input tensor
    +input_0 = InferInput(name="input_0", datatype="FP32", shape=[1, 2])
    +input_0.set_data_from_numpy(fp16_data, binary_data=False)
    +
    +# Create inference request with binary_data_output set to True
    +infer_request = InferRequest(
    +    model_name="mymodel",
    +    request_id="2ja0ls9j1309",
    +    infer_inputs=[input_0],
    +    parameters={"binary_data_output": True}
    +)
    +
    +# Create the REST client
    +config = RESTConfig(verbose=True, protocol="v2")
    +rest_client = InferenceRESTClient(config=config)
    +
    +# Send the request
    +infer_response = await rest_client.infer(
    +                      "http://localhost:8000",
    +                      model_name="TestModel",
    +                      data=infer_request,
    +                      headers={"Host": "test-server.com"},
    +                      timeout=2,
    +                 )
    +
    +# Read the binary data from the response
    +output_0 = infer_response.outputs[0]
    +fp16_output = output_0.as_numpy()
    +output_1 = infer_response.outputs[1]
    +fp32_output_arr = output_1.as_numpy()
    +
    +
    +
    +
    + + + Back to top + +
    + +
    +
    +
    +
    + + + + \ No newline at end of file diff --git a/master/modelserving/data_plane/data_plane/index.html b/master/modelserving/data_plane/data_plane/index.html index 42ea4d41b..df51c4f73 100644 --- a/master/modelserving/data_plane/data_plane/index.html +++ b/master/modelserving/data_plane/data_plane/index.html @@ -409,6 +409,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/data_plane/v1_protocol/index.html b/master/modelserving/data_plane/v1_protocol/index.html index 9f4b9be5c..5135aa753 100644 --- a/master/modelserving/data_plane/v1_protocol/index.html +++ b/master/modelserving/data_plane/v1_protocol/index.html @@ -376,6 +376,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/data_plane/v2_protocol/index.html b/master/modelserving/data_plane/v2_protocol/index.html index a434f644a..9705cf87a 100644 --- a/master/modelserving/data_plane/v2_protocol/index.html +++ b/master/modelserving/data_plane/v2_protocol/index.html @@ -655,6 +655,26 @@

    +
  • + + + +
  • @@ -2803,13 +2823,13 @@

    Tensor Data Types diff --git a/master/modelserving/storage/azure/azure/index.html b/master/modelserving/storage/azure/azure/index.html index 31653cdde..281a268d3 100644 --- a/master/modelserving/storage/azure/azure/index.html +++ b/master/modelserving/storage/azure/azure/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/gcs/gcs/index.html b/master/modelserving/storage/gcs/gcs/index.html index 32c8865f2..9c4725737 100644 --- a/master/modelserving/storage/gcs/gcs/index.html +++ b/master/modelserving/storage/gcs/gcs/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/oci/index.html b/master/modelserving/storage/oci/index.html index 81ef93f39..e78b432a6 100644 --- a/master/modelserving/storage/oci/index.html +++ b/master/modelserving/storage/oci/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/pvc/pvc/index.html b/master/modelserving/storage/pvc/pvc/index.html index 33a15adc2..a9c256c96 100644 --- a/master/modelserving/storage/pvc/pvc/index.html +++ b/master/modelserving/storage/pvc/pvc/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/s3/s3/index.html b/master/modelserving/storage/s3/s3/index.html index db2c5dcfb..2e6e0f59b 100644 --- a/master/modelserving/storage/s3/s3/index.html +++ b/master/modelserving/storage/s3/s3/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/storagecontainers/index.html b/master/modelserving/storage/storagecontainers/index.html index 9207d1c92..d31568743 100644 --- a/master/modelserving/storage/storagecontainers/index.html +++ b/master/modelserving/storage/storagecontainers/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/storage/uri/uri/index.html b/master/modelserving/storage/uri/uri/index.html index 71d436057..685266369 100644 --- a/master/modelserving/storage/uri/uri/index.html +++ b/master/modelserving/storage/uri/uri/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/amd/index.html b/master/modelserving/v1beta1/amd/index.html index feb2678c8..bf1a6442b 100644 --- a/master/modelserving/v1beta1/amd/index.html +++ b/master/modelserving/v1beta1/amd/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/custom/custom_model/index.html b/master/modelserving/v1beta1/custom/custom_model/index.html index e67e9aeda..f46b256f8 100644 --- a/master/modelserving/v1beta1/custom/custom_model/index.html +++ b/master/modelserving/v1beta1/custom/custom_model/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/lightgbm/index.html b/master/modelserving/v1beta1/lightgbm/index.html index 2b67dfba5..6ea8a48ed 100644 --- a/master/modelserving/v1beta1/lightgbm/index.html +++ b/master/modelserving/v1beta1/lightgbm/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/fill_mask/index.html b/master/modelserving/v1beta1/llm/huggingface/fill_mask/index.html index 8d2e37487..4c48c4589 100644 --- a/master/modelserving/v1beta1/llm/huggingface/fill_mask/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/fill_mask/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/index.html b/master/modelserving/v1beta1/llm/huggingface/index.html index 6631c2be0..a476ed4e4 100644 --- a/master/modelserving/v1beta1/llm/huggingface/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/sdk_integration/index.html b/master/modelserving/v1beta1/llm/huggingface/sdk_integration/index.html index 394e0d12a..2036db34b 100644 --- a/master/modelserving/v1beta1/llm/huggingface/sdk_integration/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/sdk_integration/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/text2text_generation/index.html b/master/modelserving/v1beta1/llm/huggingface/text2text_generation/index.html index 300ac5bd4..41c1d16a3 100644 --- a/master/modelserving/v1beta1/llm/huggingface/text2text_generation/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/text2text_generation/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/text_classification/index.html b/master/modelserving/v1beta1/llm/huggingface/text_classification/index.html index 5964760a7..b54a01a1e 100644 --- a/master/modelserving/v1beta1/llm/huggingface/text_classification/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/text_classification/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/text_generation/index.html b/master/modelserving/v1beta1/llm/huggingface/text_generation/index.html index 813ce3e8a..e5793e383 100644 --- a/master/modelserving/v1beta1/llm/huggingface/text_generation/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/text_generation/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/huggingface/token_classification/index.html b/master/modelserving/v1beta1/llm/huggingface/token_classification/index.html index 8fd77a103..c7ca07b56 100644 --- a/master/modelserving/v1beta1/llm/huggingface/token_classification/index.html +++ b/master/modelserving/v1beta1/llm/huggingface/token_classification/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/torchserve/accelerate/index.html b/master/modelserving/v1beta1/llm/torchserve/accelerate/index.html index fa95ac97f..8dd820ce3 100644 --- a/master/modelserving/v1beta1/llm/torchserve/accelerate/index.html +++ b/master/modelserving/v1beta1/llm/torchserve/accelerate/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/llm/vllm/index.html b/master/modelserving/v1beta1/llm/vllm/index.html index 160b96735..a45d66641 100644 --- a/master/modelserving/v1beta1/llm/vllm/index.html +++ b/master/modelserving/v1beta1/llm/vllm/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/mlflow/v2/index.html b/master/modelserving/v1beta1/mlflow/v2/index.html index c0acc3631..9f50b3dbb 100644 --- a/master/modelserving/v1beta1/mlflow/v2/index.html +++ b/master/modelserving/v1beta1/mlflow/v2/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/onnx/index.html b/master/modelserving/v1beta1/onnx/index.html index c20fc6764..96a73bd9a 100644 --- a/master/modelserving/v1beta1/onnx/index.html +++ b/master/modelserving/v1beta1/onnx/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/paddle/index.html b/master/modelserving/v1beta1/paddle/index.html index 9ec464632..81e1bdce8 100644 --- a/master/modelserving/v1beta1/paddle/index.html +++ b/master/modelserving/v1beta1/paddle/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/pmml/index.html b/master/modelserving/v1beta1/pmml/index.html index a83f57f01..02eae90e6 100644 --- a/master/modelserving/v1beta1/pmml/index.html +++ b/master/modelserving/v1beta1/pmml/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/rollout/canary-example/index.html b/master/modelserving/v1beta1/rollout/canary-example/index.html index aa729d85d..bd00bf39c 100644 --- a/master/modelserving/v1beta1/rollout/canary-example/index.html +++ b/master/modelserving/v1beta1/rollout/canary-example/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/rollout/canary/index.html b/master/modelserving/v1beta1/rollout/canary/index.html index 919bf25e0..83b1584fc 100644 --- a/master/modelserving/v1beta1/rollout/canary/index.html +++ b/master/modelserving/v1beta1/rollout/canary/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/serving_runtime/index.html b/master/modelserving/v1beta1/serving_runtime/index.html index 1c805c0e0..d6723ac38 100644 --- a/master/modelserving/v1beta1/serving_runtime/index.html +++ b/master/modelserving/v1beta1/serving_runtime/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/sklearn/v2/index.html b/master/modelserving/v1beta1/sklearn/v2/index.html index af686520f..364eefc2d 100644 --- a/master/modelserving/v1beta1/sklearn/v2/index.html +++ b/master/modelserving/v1beta1/sklearn/v2/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/spark/index.html b/master/modelserving/v1beta1/spark/index.html index 08b92b729..287809d08 100644 --- a/master/modelserving/v1beta1/spark/index.html +++ b/master/modelserving/v1beta1/spark/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/tensorflow/index.html b/master/modelserving/v1beta1/tensorflow/index.html index bd6c648d6..3dc8fa383 100644 --- a/master/modelserving/v1beta1/tensorflow/index.html +++ b/master/modelserving/v1beta1/tensorflow/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/bert/index.html b/master/modelserving/v1beta1/torchserve/bert/index.html index f6ee9111e..52f9b6795 100644 --- a/master/modelserving/v1beta1/torchserve/bert/index.html +++ b/master/modelserving/v1beta1/torchserve/bert/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/index.html b/master/modelserving/v1beta1/torchserve/index.html index 75de947cc..f9dbc00e6 100644 --- a/master/modelserving/v1beta1/torchserve/index.html +++ b/master/modelserving/v1beta1/torchserve/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/metrics/index.html b/master/modelserving/v1beta1/torchserve/metrics/index.html index f4014f1f7..095adaa47 100644 --- a/master/modelserving/v1beta1/torchserve/metrics/index.html +++ b/master/modelserving/v1beta1/torchserve/metrics/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/model-archiver/index.html b/master/modelserving/v1beta1/torchserve/model-archiver/index.html index 677f2f1fc..38e9852ae 100644 --- a/master/modelserving/v1beta1/torchserve/model-archiver/index.html +++ b/master/modelserving/v1beta1/torchserve/model-archiver/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/model-archiver/model-archiver-image/index.html b/master/modelserving/v1beta1/torchserve/model-archiver/model-archiver-image/index.html index 9a2e6d90f..c6c702a4c 100644 --- a/master/modelserving/v1beta1/torchserve/model-archiver/model-archiver-image/index.html +++ b/master/modelserving/v1beta1/torchserve/model-archiver/model-archiver-image/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/torchserve/model-archiver/model-store/index.html b/master/modelserving/v1beta1/torchserve/model-archiver/model-store/index.html index 769638b1b..09f4760d0 100644 --- a/master/modelserving/v1beta1/torchserve/model-archiver/model-store/index.html +++ b/master/modelserving/v1beta1/torchserve/model-archiver/model-store/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/transformer/collocation/index.html b/master/modelserving/v1beta1/transformer/collocation/index.html index dbcdc6515..fb4a9e231 100644 --- a/master/modelserving/v1beta1/transformer/collocation/index.html +++ b/master/modelserving/v1beta1/transformer/collocation/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/transformer/feast/index.html b/master/modelserving/v1beta1/transformer/feast/index.html index db886a837..7f5130bcf 100644 --- a/master/modelserving/v1beta1/transformer/feast/index.html +++ b/master/modelserving/v1beta1/transformer/feast/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/transformer/torchserve_image_transformer/index.html b/master/modelserving/v1beta1/transformer/torchserve_image_transformer/index.html index fb6974f29..e7756bd0d 100644 --- a/master/modelserving/v1beta1/transformer/torchserve_image_transformer/index.html +++ b/master/modelserving/v1beta1/transformer/torchserve_image_transformer/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/triton/bert/index.html b/master/modelserving/v1beta1/triton/bert/index.html index c734b14c1..6ab466f92 100644 --- a/master/modelserving/v1beta1/triton/bert/index.html +++ b/master/modelserving/v1beta1/triton/bert/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/triton/huggingface/index.html b/master/modelserving/v1beta1/triton/huggingface/index.html index dd071e815..be70266c1 100644 --- a/master/modelserving/v1beta1/triton/huggingface/index.html +++ b/master/modelserving/v1beta1/triton/huggingface/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/triton/torchscript/index.html b/master/modelserving/v1beta1/triton/torchscript/index.html index 222373486..f7e2e62fa 100644 --- a/master/modelserving/v1beta1/triton/torchscript/index.html +++ b/master/modelserving/v1beta1/triton/torchscript/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/modelserving/v1beta1/xgboost/index.html b/master/modelserving/v1beta1/xgboost/index.html index 150e56ad1..990e65a66 100644 --- a/master/modelserving/v1beta1/xgboost/index.html +++ b/master/modelserving/v1beta1/xgboost/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/python_runtime_api/docs/api/index.html b/master/python_runtime_api/docs/api/index.html index fc880ab25..018c1e236 100644 --- a/master/python_runtime_api/docs/api/index.html +++ b/master/python_runtime_api/docs/api/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/python_runtime_api/docs/index.html b/master/python_runtime_api/docs/index.html index 10bf2a085..7056ad8ae 100644 --- a/master/python_runtime_api/docs/index.html +++ b/master/python_runtime_api/docs/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/reference/api/index.html b/master/reference/api/index.html index aba3ff0d1..af43a46fb 100644 --- a/master/reference/api/index.html +++ b/master/reference/api/index.html @@ -355,6 +355,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/reference/swagger-ui/index.html b/master/reference/swagger-ui/index.html index 4804dbeff..a7dfd0f4d 100644 --- a/master/reference/swagger-ui/index.html +++ b/master/reference/swagger-ui/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/reference/v2_inference/index.html b/master/reference/v2_inference/index.html index 4a1e8b79a..76bb47657 100644 --- a/master/reference/v2_inference/index.html +++ b/master/reference/v2_inference/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/reference/v2_inference/template/index.html b/master/reference/v2_inference/template/index.html index 5549c3819..52d2245fa 100644 --- a/master/reference/v2_inference/template/index.html +++ b/master/reference/v2_inference/template/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KServeClient/index.html b/master/sdk_docs/docs/KServeClient/index.html index 47ac713e2..ea965d1ef 100644 --- a/master/sdk_docs/docs/KServeClient/index.html +++ b/master/sdk_docs/docs/KServeClient/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KnativeAddressable/index.html b/master/sdk_docs/docs/KnativeAddressable/index.html index 22d7ca8d6..a2fff432a 100644 --- a/master/sdk_docs/docs/KnativeAddressable/index.html +++ b/master/sdk_docs/docs/KnativeAddressable/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KnativeCondition/index.html b/master/sdk_docs/docs/KnativeCondition/index.html index 0ededb681..45680cdf1 100644 --- a/master/sdk_docs/docs/KnativeCondition/index.html +++ b/master/sdk_docs/docs/KnativeCondition/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KnativeStatus/index.html b/master/sdk_docs/docs/KnativeStatus/index.html index 5d4680716..d6338340c 100644 --- a/master/sdk_docs/docs/KnativeStatus/index.html +++ b/master/sdk_docs/docs/KnativeStatus/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KnativeURL/index.html b/master/sdk_docs/docs/KnativeURL/index.html index 8172860c5..5e93ce641 100644 --- a/master/sdk_docs/docs/KnativeURL/index.html +++ b/master/sdk_docs/docs/KnativeURL/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/KnativeVolatileTime/index.html b/master/sdk_docs/docs/KnativeVolatileTime/index.html index 18dec897b..d2983d134 100644 --- a/master/sdk_docs/docs/KnativeVolatileTime/index.html +++ b/master/sdk_docs/docs/KnativeVolatileTime/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/NetUrlUserinfo/index.html b/master/sdk_docs/docs/NetUrlUserinfo/index.html index 40738c62d..43eb9cf70 100644 --- a/master/sdk_docs/docs/NetUrlUserinfo/index.html +++ b/master/sdk_docs/docs/NetUrlUserinfo/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1Time/index.html b/master/sdk_docs/docs/V1Time/index.html index bd27dbabd..513eae125 100644 --- a/master/sdk_docs/docs/V1Time/index.html +++ b/master/sdk_docs/docs/V1Time/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1BuiltInAdapter/index.html b/master/sdk_docs/docs/V1alpha1BuiltInAdapter/index.html index caeefa4d9..b2a753c34 100644 --- a/master/sdk_docs/docs/V1alpha1BuiltInAdapter/index.html +++ b/master/sdk_docs/docs/V1alpha1BuiltInAdapter/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ClusterServingRuntime/index.html b/master/sdk_docs/docs/V1alpha1ClusterServingRuntime/index.html index 3795bafb7..193ce668c 100644 --- a/master/sdk_docs/docs/V1alpha1ClusterServingRuntime/index.html +++ b/master/sdk_docs/docs/V1alpha1ClusterServingRuntime/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ClusterServingRuntimeList/index.html b/master/sdk_docs/docs/V1alpha1ClusterServingRuntimeList/index.html index 98c29eb9d..cf6e7c09e 100644 --- a/master/sdk_docs/docs/V1alpha1ClusterServingRuntimeList/index.html +++ b/master/sdk_docs/docs/V1alpha1ClusterServingRuntimeList/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1Container/index.html b/master/sdk_docs/docs/V1alpha1Container/index.html index a3f0e89f3..f58a5478d 100644 --- a/master/sdk_docs/docs/V1alpha1Container/index.html +++ b/master/sdk_docs/docs/V1alpha1Container/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceGraph/index.html b/master/sdk_docs/docs/V1alpha1InferenceGraph/index.html index d1f3a1f5d..2d6782989 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceGraph/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceGraph/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceGraphList/index.html b/master/sdk_docs/docs/V1alpha1InferenceGraphList/index.html index d9f8bdd6d..0a40a9ecf 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceGraphList/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceGraphList/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceGraphSpec/index.html b/master/sdk_docs/docs/V1alpha1InferenceGraphSpec/index.html index fe2d8b3c8..49af9bcea 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceGraphSpec/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceGraphSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceGraphStatus/index.html b/master/sdk_docs/docs/V1alpha1InferenceGraphStatus/index.html index f671a04b9..13410a9cf 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceGraphStatus/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceGraphStatus/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceRouter/index.html b/master/sdk_docs/docs/V1alpha1InferenceRouter/index.html index ecb43333b..a049431a0 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceRouter/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceRouter/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceStep/index.html b/master/sdk_docs/docs/V1alpha1InferenceStep/index.html index 5580f46fc..21a17cbfe 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceStep/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceStep/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1InferenceTarget/index.html b/master/sdk_docs/docs/V1alpha1InferenceTarget/index.html index 072fe591d..b2712e8f2 100644 --- a/master/sdk_docs/docs/V1alpha1InferenceTarget/index.html +++ b/master/sdk_docs/docs/V1alpha1InferenceTarget/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ServingRuntime/index.html b/master/sdk_docs/docs/V1alpha1ServingRuntime/index.html index b08424532..e207a90f5 100644 --- a/master/sdk_docs/docs/V1alpha1ServingRuntime/index.html +++ b/master/sdk_docs/docs/V1alpha1ServingRuntime/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ServingRuntimeList/index.html b/master/sdk_docs/docs/V1alpha1ServingRuntimeList/index.html index 3822a3ed7..372b62b0a 100644 --- a/master/sdk_docs/docs/V1alpha1ServingRuntimeList/index.html +++ b/master/sdk_docs/docs/V1alpha1ServingRuntimeList/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ServingRuntimePodSpec/index.html b/master/sdk_docs/docs/V1alpha1ServingRuntimePodSpec/index.html index 92d9cfc5c..dd6ed2dab 100644 --- a/master/sdk_docs/docs/V1alpha1ServingRuntimePodSpec/index.html +++ b/master/sdk_docs/docs/V1alpha1ServingRuntimePodSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1ServingRuntimeSpec/index.html b/master/sdk_docs/docs/V1alpha1ServingRuntimeSpec/index.html index a49bc83c5..93ddb6e69 100644 --- a/master/sdk_docs/docs/V1alpha1ServingRuntimeSpec/index.html +++ b/master/sdk_docs/docs/V1alpha1ServingRuntimeSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1StorageHelper/index.html b/master/sdk_docs/docs/V1alpha1StorageHelper/index.html index 2c7c362af..c4db5c592 100644 --- a/master/sdk_docs/docs/V1alpha1StorageHelper/index.html +++ b/master/sdk_docs/docs/V1alpha1StorageHelper/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1alpha1SupportedModelFormat/index.html b/master/sdk_docs/docs/V1alpha1SupportedModelFormat/index.html index 8323bd56f..d5367c008 100644 --- a/master/sdk_docs/docs/V1alpha1SupportedModelFormat/index.html +++ b/master/sdk_docs/docs/V1alpha1SupportedModelFormat/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1AIXExplainerSpec/index.html b/master/sdk_docs/docs/V1beta1AIXExplainerSpec/index.html index 365030150..44e99c14d 100644 --- a/master/sdk_docs/docs/V1beta1AIXExplainerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1AIXExplainerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ARTExplainerSpec/index.html b/master/sdk_docs/docs/V1beta1ARTExplainerSpec/index.html index 0ed7c4c08..bc6b6fd2d 100644 --- a/master/sdk_docs/docs/V1beta1ARTExplainerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ARTExplainerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1AlibiExplainerSpec/index.html b/master/sdk_docs/docs/V1beta1AlibiExplainerSpec/index.html index 94e7fad69..97112f625 100644 --- a/master/sdk_docs/docs/V1beta1AlibiExplainerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1AlibiExplainerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1Batcher/index.html b/master/sdk_docs/docs/V1beta1Batcher/index.html index 82e581615..b19e97719 100644 --- a/master/sdk_docs/docs/V1beta1Batcher/index.html +++ b/master/sdk_docs/docs/V1beta1Batcher/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ComponentExtensionSpec/index.html b/master/sdk_docs/docs/V1beta1ComponentExtensionSpec/index.html index 93b1f00d9..5325c6fd6 100644 --- a/master/sdk_docs/docs/V1beta1ComponentExtensionSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ComponentExtensionSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ComponentStatusSpec/index.html b/master/sdk_docs/docs/V1beta1ComponentStatusSpec/index.html index fec0637d9..f778380df 100644 --- a/master/sdk_docs/docs/V1beta1ComponentStatusSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ComponentStatusSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1CustomExplainer/index.html b/master/sdk_docs/docs/V1beta1CustomExplainer/index.html index d732541b1..60f6164ba 100644 --- a/master/sdk_docs/docs/V1beta1CustomExplainer/index.html +++ b/master/sdk_docs/docs/V1beta1CustomExplainer/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1CustomPredictor/index.html b/master/sdk_docs/docs/V1beta1CustomPredictor/index.html index 36f0f33b0..c3c277aaa 100644 --- a/master/sdk_docs/docs/V1beta1CustomPredictor/index.html +++ b/master/sdk_docs/docs/V1beta1CustomPredictor/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1CustomTransformer/index.html b/master/sdk_docs/docs/V1beta1CustomTransformer/index.html index 51ae28a66..54eb7d69b 100644 --- a/master/sdk_docs/docs/V1beta1CustomTransformer/index.html +++ b/master/sdk_docs/docs/V1beta1CustomTransformer/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ExplainerConfig/index.html b/master/sdk_docs/docs/V1beta1ExplainerConfig/index.html index 2cfd45b89..c3e3bb289 100644 --- a/master/sdk_docs/docs/V1beta1ExplainerConfig/index.html +++ b/master/sdk_docs/docs/V1beta1ExplainerConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ExplainerExtensionSpec/index.html b/master/sdk_docs/docs/V1beta1ExplainerExtensionSpec/index.html index 90e003b89..5f10a511e 100644 --- a/master/sdk_docs/docs/V1beta1ExplainerExtensionSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ExplainerExtensionSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ExplainerSpec/index.html b/master/sdk_docs/docs/V1beta1ExplainerSpec/index.html index 81fa6892c..e8e35fdcf 100644 --- a/master/sdk_docs/docs/V1beta1ExplainerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ExplainerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ExplainersConfig/index.html b/master/sdk_docs/docs/V1beta1ExplainersConfig/index.html index 3a36091f7..7ebb4ab82 100644 --- a/master/sdk_docs/docs/V1beta1ExplainersConfig/index.html +++ b/master/sdk_docs/docs/V1beta1ExplainersConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1FailureInfo/index.html b/master/sdk_docs/docs/V1beta1FailureInfo/index.html index 050c26a80..ef37308f5 100644 --- a/master/sdk_docs/docs/V1beta1FailureInfo/index.html +++ b/master/sdk_docs/docs/V1beta1FailureInfo/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1InferenceService/index.html b/master/sdk_docs/docs/V1beta1InferenceService/index.html index 89153aa1e..b4bc8f2fa 100644 --- a/master/sdk_docs/docs/V1beta1InferenceService/index.html +++ b/master/sdk_docs/docs/V1beta1InferenceService/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1InferenceServiceList/index.html b/master/sdk_docs/docs/V1beta1InferenceServiceList/index.html index 11ad3eea1..5983e3af0 100644 --- a/master/sdk_docs/docs/V1beta1InferenceServiceList/index.html +++ b/master/sdk_docs/docs/V1beta1InferenceServiceList/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1InferenceServiceSpec/index.html b/master/sdk_docs/docs/V1beta1InferenceServiceSpec/index.html index 6d39ac4c9..eceb2a42f 100644 --- a/master/sdk_docs/docs/V1beta1InferenceServiceSpec/index.html +++ b/master/sdk_docs/docs/V1beta1InferenceServiceSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1InferenceServiceStatus/index.html b/master/sdk_docs/docs/V1beta1InferenceServiceStatus/index.html index 6dfd787f7..826fc284f 100644 --- a/master/sdk_docs/docs/V1beta1InferenceServiceStatus/index.html +++ b/master/sdk_docs/docs/V1beta1InferenceServiceStatus/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1InferenceServicesConfig/index.html b/master/sdk_docs/docs/V1beta1InferenceServicesConfig/index.html index 63e963e8e..1fc6d3266 100644 --- a/master/sdk_docs/docs/V1beta1InferenceServicesConfig/index.html +++ b/master/sdk_docs/docs/V1beta1InferenceServicesConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1IngressConfig/index.html b/master/sdk_docs/docs/V1beta1IngressConfig/index.html index 22ac957fb..8375339ed 100644 --- a/master/sdk_docs/docs/V1beta1IngressConfig/index.html +++ b/master/sdk_docs/docs/V1beta1IngressConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1LightGBMSpec/index.html b/master/sdk_docs/docs/V1beta1LightGBMSpec/index.html index 79a5ffbd9..e6dee9031 100644 --- a/master/sdk_docs/docs/V1beta1LightGBMSpec/index.html +++ b/master/sdk_docs/docs/V1beta1LightGBMSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1LoggerSpec/index.html b/master/sdk_docs/docs/V1beta1LoggerSpec/index.html index b83735381..8f4ed60e0 100644 --- a/master/sdk_docs/docs/V1beta1LoggerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1LoggerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ModelCopies/index.html b/master/sdk_docs/docs/V1beta1ModelCopies/index.html index ebc3bbe89..0f21fe635 100644 --- a/master/sdk_docs/docs/V1beta1ModelCopies/index.html +++ b/master/sdk_docs/docs/V1beta1ModelCopies/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ModelFormat/index.html b/master/sdk_docs/docs/V1beta1ModelFormat/index.html index 903fb5d64..b78ac182b 100644 --- a/master/sdk_docs/docs/V1beta1ModelFormat/index.html +++ b/master/sdk_docs/docs/V1beta1ModelFormat/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ModelRevisionStates/index.html b/master/sdk_docs/docs/V1beta1ModelRevisionStates/index.html index e5372f8fa..74dce2cad 100644 --- a/master/sdk_docs/docs/V1beta1ModelRevisionStates/index.html +++ b/master/sdk_docs/docs/V1beta1ModelRevisionStates/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ModelSpec/index.html b/master/sdk_docs/docs/V1beta1ModelSpec/index.html index 0808dd67d..07e926075 100644 --- a/master/sdk_docs/docs/V1beta1ModelSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ModelSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ModelStatus/index.html b/master/sdk_docs/docs/V1beta1ModelStatus/index.html index 4cd8762b8..85b5165bc 100644 --- a/master/sdk_docs/docs/V1beta1ModelStatus/index.html +++ b/master/sdk_docs/docs/V1beta1ModelStatus/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1ONNXRuntimeSpec/index.html b/master/sdk_docs/docs/V1beta1ONNXRuntimeSpec/index.html index 83e165969..69cab8453 100644 --- a/master/sdk_docs/docs/V1beta1ONNXRuntimeSpec/index.html +++ b/master/sdk_docs/docs/V1beta1ONNXRuntimeSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PMMLSpec/index.html b/master/sdk_docs/docs/V1beta1PMMLSpec/index.html index 9aae08efb..5a2884434 100644 --- a/master/sdk_docs/docs/V1beta1PMMLSpec/index.html +++ b/master/sdk_docs/docs/V1beta1PMMLSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PaddleServerSpec/index.html b/master/sdk_docs/docs/V1beta1PaddleServerSpec/index.html index e78f59817..93d66a212 100644 --- a/master/sdk_docs/docs/V1beta1PaddleServerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1PaddleServerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PodSpec/index.html b/master/sdk_docs/docs/V1beta1PodSpec/index.html index 9cb8d272b..dafc7c4cf 100644 --- a/master/sdk_docs/docs/V1beta1PodSpec/index.html +++ b/master/sdk_docs/docs/V1beta1PodSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PredictorConfig/index.html b/master/sdk_docs/docs/V1beta1PredictorConfig/index.html index 4e10a2e2f..f59378301 100644 --- a/master/sdk_docs/docs/V1beta1PredictorConfig/index.html +++ b/master/sdk_docs/docs/V1beta1PredictorConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PredictorExtensionSpec/index.html b/master/sdk_docs/docs/V1beta1PredictorExtensionSpec/index.html index 18fd395c5..f4fb32943 100644 --- a/master/sdk_docs/docs/V1beta1PredictorExtensionSpec/index.html +++ b/master/sdk_docs/docs/V1beta1PredictorExtensionSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PredictorProtocols/index.html b/master/sdk_docs/docs/V1beta1PredictorProtocols/index.html index b072e6c6e..d4ee0b6ad 100644 --- a/master/sdk_docs/docs/V1beta1PredictorProtocols/index.html +++ b/master/sdk_docs/docs/V1beta1PredictorProtocols/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PredictorSpec/index.html b/master/sdk_docs/docs/V1beta1PredictorSpec/index.html index cf5e08469..b7c2bb559 100644 --- a/master/sdk_docs/docs/V1beta1PredictorSpec/index.html +++ b/master/sdk_docs/docs/V1beta1PredictorSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1PredictorsConfig/index.html b/master/sdk_docs/docs/V1beta1PredictorsConfig/index.html index b8729ee0d..c29115999 100644 --- a/master/sdk_docs/docs/V1beta1PredictorsConfig/index.html +++ b/master/sdk_docs/docs/V1beta1PredictorsConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1SKLearnSpec/index.html b/master/sdk_docs/docs/V1beta1SKLearnSpec/index.html index 633afc9d9..1c8a00049 100644 --- a/master/sdk_docs/docs/V1beta1SKLearnSpec/index.html +++ b/master/sdk_docs/docs/V1beta1SKLearnSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1StorageSpec/index.html b/master/sdk_docs/docs/V1beta1StorageSpec/index.html index 11471fcbe..812b499e1 100644 --- a/master/sdk_docs/docs/V1beta1StorageSpec/index.html +++ b/master/sdk_docs/docs/V1beta1StorageSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TFServingSpec/index.html b/master/sdk_docs/docs/V1beta1TFServingSpec/index.html index 8b4ad1a92..372a50505 100644 --- a/master/sdk_docs/docs/V1beta1TFServingSpec/index.html +++ b/master/sdk_docs/docs/V1beta1TFServingSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TorchServeSpec/index.html b/master/sdk_docs/docs/V1beta1TorchServeSpec/index.html index 505f9b945..a0a3f67e8 100644 --- a/master/sdk_docs/docs/V1beta1TorchServeSpec/index.html +++ b/master/sdk_docs/docs/V1beta1TorchServeSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TransformerConfig/index.html b/master/sdk_docs/docs/V1beta1TransformerConfig/index.html index 73bfb9c87..054fc0c3e 100644 --- a/master/sdk_docs/docs/V1beta1TransformerConfig/index.html +++ b/master/sdk_docs/docs/V1beta1TransformerConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TransformerSpec/index.html b/master/sdk_docs/docs/V1beta1TransformerSpec/index.html index 303c39194..dc964c8f7 100644 --- a/master/sdk_docs/docs/V1beta1TransformerSpec/index.html +++ b/master/sdk_docs/docs/V1beta1TransformerSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TransformersConfig/index.html b/master/sdk_docs/docs/V1beta1TransformersConfig/index.html index 985979710..89ea75079 100644 --- a/master/sdk_docs/docs/V1beta1TransformersConfig/index.html +++ b/master/sdk_docs/docs/V1beta1TransformersConfig/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1TritonSpec/index.html b/master/sdk_docs/docs/V1beta1TritonSpec/index.html index daa2f50c9..9792f1f23 100644 --- a/master/sdk_docs/docs/V1beta1TritonSpec/index.html +++ b/master/sdk_docs/docs/V1beta1TritonSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/docs/V1beta1XGBoostSpec/index.html b/master/sdk_docs/docs/V1beta1XGBoostSpec/index.html index 52bf781c9..45dd14f06 100644 --- a/master/sdk_docs/docs/V1beta1XGBoostSpec/index.html +++ b/master/sdk_docs/docs/V1beta1XGBoostSpec/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/sdk_docs/sdk_doc/index.html b/master/sdk_docs/sdk_doc/index.html index bb26021f6..9af9cde65 100644 --- a/master/sdk_docs/sdk_doc/index.html +++ b/master/sdk_docs/sdk_doc/index.html @@ -358,6 +358,26 @@

    Open Inference Protocol (V2 Inference Protocol) +
  • + + + +
  • diff --git a/master/search/search_index.json b/master/search/search_index.json index b525289f6..e115409d3 100644 --- a/master/search/search_index.json +++ b/master/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"","title":"Home"},{"location":"admin/kubernetes_deployment/","text":"Kubernetes Deployment Installation Guide \u00b6 KServe supports RawDeployment mode to enable InferenceService deployment with Kubernetes resources Deployment , Service , Ingress and Horizontal Pod Autoscaler . Comparing to serverless deployment it unlocks Knative limitations such as mounting multiple volumes, on the other hand Scale down and from Zero is not supported in RawDeployment mode. Kubernetes 1.22 is the minimally required version and please check the following recommended Istio versions for the corresponding Kubernetes version. Recommended Version Matrix \u00b6 Kubernetes Version Recommended Istio Version 1.27 1.18, 1.19 1.28 1.19, 1.20 1.29 1.20, 1.21 1. Install Istio \u00b6 The minimally required Istio version is 1.13 and you can refer to the Istio install guide . Once Istio is installed, create IngressClass resource for istio. apiVersion : networking.k8s.io/v1 kind : IngressClass metadata : name : istio spec : controller : istio.io/ingress-controller Note Istio ingress is recommended, but you can choose to install with other Ingress controllers and create IngressClass resource for your Ingress option. 2. Install Cert Manager \u00b6 The minimally required Cert Manager version is 1.15.0 and you can refer to Cert Manager installation guide . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script. 3. Install KServe \u00b6 Note The default KServe deployment mode is Serverless which depends on Knative. The following step changes the default deployment mode to RawDeployment before installing KServe. i. Install KServe kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml Install KServe default serving runtimes: kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yaml ii. Change default deployment mode and ingress option First in ConfigMap inferenceservice-config modify the defaultDeploymentMode in the deploy section, kubectl kubectl patch configmap/inferenceservice-config -n kserve --type = strategic -p '{\"data\": {\"deploy\": \"{\\\"defaultDeploymentMode\\\": \\\"RawDeployment\\\"}\"}}' then modify the ingressClassName in ingress section to point to IngressClass name created in step 1 . ingress : |- { \"ingressClassName\" : \"your-ingress-class\" , }","title":"Kubernetes deployment installation"},{"location":"admin/kubernetes_deployment/#kubernetes-deployment-installation-guide","text":"KServe supports RawDeployment mode to enable InferenceService deployment with Kubernetes resources Deployment , Service , Ingress and Horizontal Pod Autoscaler . Comparing to serverless deployment it unlocks Knative limitations such as mounting multiple volumes, on the other hand Scale down and from Zero is not supported in RawDeployment mode. Kubernetes 1.22 is the minimally required version and please check the following recommended Istio versions for the corresponding Kubernetes version.","title":"Kubernetes Deployment Installation Guide"},{"location":"admin/kubernetes_deployment/#recommended-version-matrix","text":"Kubernetes Version Recommended Istio Version 1.27 1.18, 1.19 1.28 1.19, 1.20 1.29 1.20, 1.21","title":"Recommended Version Matrix"},{"location":"admin/kubernetes_deployment/#1-install-istio","text":"The minimally required Istio version is 1.13 and you can refer to the Istio install guide . Once Istio is installed, create IngressClass resource for istio. apiVersion : networking.k8s.io/v1 kind : IngressClass metadata : name : istio spec : controller : istio.io/ingress-controller Note Istio ingress is recommended, but you can choose to install with other Ingress controllers and create IngressClass resource for your Ingress option.","title":"1. Install Istio"},{"location":"admin/kubernetes_deployment/#2-install-cert-manager","text":"The minimally required Cert Manager version is 1.15.0 and you can refer to Cert Manager installation guide . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script.","title":"2. Install Cert Manager"},{"location":"admin/kubernetes_deployment/#3-install-kserve","text":"Note The default KServe deployment mode is Serverless which depends on Knative. The following step changes the default deployment mode to RawDeployment before installing KServe. i. Install KServe kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml Install KServe default serving runtimes: kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yaml ii. Change default deployment mode and ingress option First in ConfigMap inferenceservice-config modify the defaultDeploymentMode in the deploy section, kubectl kubectl patch configmap/inferenceservice-config -n kserve --type = strategic -p '{\"data\": {\"deploy\": \"{\\\"defaultDeploymentMode\\\": \\\"RawDeployment\\\"}\"}}' then modify the ingressClassName in ingress section to point to IngressClass name created in step 1 . ingress : |- { \"ingressClassName\" : \"your-ingress-class\" , }","title":"3. Install KServe"},{"location":"admin/migration/","text":"Migrating from KFServing \u00b6 This doc explains how to migrate existing inference services from KFServing to KServe without downtime. Note The migration job will by default delete the leftover KFServing installation after migrating the inference services from serving.kubeflow.org to serving.kserve.io . Migrating from standalone KFServing \u00b6 Install KServe v0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve.yaml Run the KServe Migration YAML This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it If your KFServing is installed in a namespace other than kfserving-system , then download and set the env KFSERVING_NAMESPACE in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kserve Migrating from Kubeflow-based KFServing \u00b6 Install Kubeflow-based KServe 0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve_kubeflow.yaml Run the KServe Migration YAML for Kubeflow-based installations This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job_kubeflow.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kubeflow Update the models web app to use the new InferenceService API group serving.kserve.io Change the deployment image to kserve/models-web-app:v0.7.0-rc0 This is a temporary fix until the next Kubeflow release includes these changes kubectl edit deployment kfserving-models-web-app -n kubeflow Update the cluster role to be able to access the new InferenceService API group serving.kserve.io Edit the apiGroups from serving.kubeflow.org to serving.kserve.io This is a temporary fix until the next Kubeflow release includes these changes kubectl edit clusterrole kfserving-models-web-app-cluster-role","title":"Migrating from KFServing"},{"location":"admin/migration/#migrating-from-kfserving","text":"This doc explains how to migrate existing inference services from KFServing to KServe without downtime. Note The migration job will by default delete the leftover KFServing installation after migrating the inference services from serving.kubeflow.org to serving.kserve.io .","title":"Migrating from KFServing"},{"location":"admin/migration/#migrating-from-standalone-kfserving","text":"Install KServe v0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve.yaml Run the KServe Migration YAML This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it If your KFServing is installed in a namespace other than kfserving-system , then download and set the env KFSERVING_NAMESPACE in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kserve","title":"Migrating from standalone KFServing"},{"location":"admin/migration/#migrating-from-kubeflow-based-kfserving","text":"Install Kubeflow-based KServe 0.7 using the install YAML This will not affect existing services yet. kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/install/v0.7.0/kserve_kubeflow.yaml Run the KServe Migration YAML for Kubeflow-based installations This will begin the migration. Any errors here may affect your existing services. If you do not want to delete the KFServing resources after migrating, download and edit the env REMOVE_KFSERVING in the YAML before applying it kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/hack/kserve_migration/kserve_migration_job_kubeflow.yaml Clean up the migration resources kubectl delete ClusterRoleBinding cluster-migration-rolebinding kubectl delete ClusterRole cluster-migration-role kubectl delete ServiceAccount cluster-migration-svcaccount -n kubeflow Update the models web app to use the new InferenceService API group serving.kserve.io Change the deployment image to kserve/models-web-app:v0.7.0-rc0 This is a temporary fix until the next Kubeflow release includes these changes kubectl edit deployment kfserving-models-web-app -n kubeflow Update the cluster role to be able to access the new InferenceService API group serving.kserve.io Edit the apiGroups from serving.kubeflow.org to serving.kserve.io This is a temporary fix until the next Kubeflow release includes these changes kubectl edit clusterrole kfserving-models-web-app-cluster-role","title":"Migrating from Kubeflow-based KFServing"},{"location":"admin/modelmesh/","text":"ModelMesh Installation Guide \u00b6 KServe ModelMesh installation enables high-scale, high-density and frequently-changing model serving use cases. A Kubernetes cluster is required. You will need cluster-admin authority. Additionally, kustomize and an etcd server on the Kubernetes cluster are required. 1. Standard Installation \u00b6 You can find the standard installation instructions in the ModelMesh Serving installation guide . This approach assumes you have installed the prerequisites such as etcd and S3-compatible object storage. 2. Quick Installation \u00b6 A quick installation allows you to quickly get ModelMesh Serving up and running without having to manually install the prerequisites. The steps are described in the ModelMesh Serving quick start guide . Note ModelMesh Serving is namespace scoped, meaning all of its components must exist within a single namespace and only one instance of ModelMesh Serving can be installed per namespace. For more details, you can check out the ModelMesh Serving getting started guide .","title":"ModelMesh installation"},{"location":"admin/modelmesh/#modelmesh-installation-guide","text":"KServe ModelMesh installation enables high-scale, high-density and frequently-changing model serving use cases. A Kubernetes cluster is required. You will need cluster-admin authority. Additionally, kustomize and an etcd server on the Kubernetes cluster are required.","title":"ModelMesh Installation Guide"},{"location":"admin/modelmesh/#1-standard-installation","text":"You can find the standard installation instructions in the ModelMesh Serving installation guide . This approach assumes you have installed the prerequisites such as etcd and S3-compatible object storage.","title":"1. Standard Installation"},{"location":"admin/modelmesh/#2-quick-installation","text":"A quick installation allows you to quickly get ModelMesh Serving up and running without having to manually install the prerequisites. The steps are described in the ModelMesh Serving quick start guide . Note ModelMesh Serving is namespace scoped, meaning all of its components must exist within a single namespace and only one instance of ModelMesh Serving can be installed per namespace. For more details, you can check out the ModelMesh Serving getting started guide .","title":"2. Quick Installation"},{"location":"admin/serverless/serverless/","text":"Serverless Installation Guide \u00b6 KServe Serverless installation enables autoscaling based on request volume and supports scale down to and from zero. It also supports revision management and canary rollout based on revisions. Kubernetes 1.28 is the minimally required version and please check the following recommended Knative, Istio versions for the corresponding Kubernetes version. Recommended Version Matrix \u00b6 Kubernetes Version Recommended Istio Version Recommended Knative Version 1.28 1.22 1.15 1.29 1.22,1.23 1.15,1.16 1.30 1.22,1.23 1.15,1.16 1. Install Knative Serving \u00b6 Please refer to Knative Serving install guide . Note If you are looking to use PodSpec fields such as nodeSelector, affinity or tolerations which are now supported in the v1beta1 API spec, you need to turn on the corresponding feature flags in your Knative configuration. Warning Knative 1.13.1 requires Istio 1.20+, gRPC routing does not work with previous Istio releases, see release notes . 2. Install Networking Layer \u00b6 The recommended networking layer for KServe is Istio as currently it works best with KServe, please refer to the Istio install guide . Alternatively you can also choose other networking layers like Kourier or Contour , see how to install Kourier with KServe guide . 3. Install Cert Manager \u00b6 The minimally required Cert Manager version is 1.15.0 and you can refer to Cert Manager . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script. 4. Install KServe \u00b6 kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml 5. Install KServe Built-in ClusterServingRuntimes \u00b6 0.13.0 kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yaml Note ClusterServingRuntimes are required to create InferenceService for built-in model serving runtimes with KServe v0.8.0 or higher.","title":"Serverless installation"},{"location":"admin/serverless/serverless/#serverless-installation-guide","text":"KServe Serverless installation enables autoscaling based on request volume and supports scale down to and from zero. It also supports revision management and canary rollout based on revisions. Kubernetes 1.28 is the minimally required version and please check the following recommended Knative, Istio versions for the corresponding Kubernetes version.","title":"Serverless Installation Guide"},{"location":"admin/serverless/serverless/#recommended-version-matrix","text":"Kubernetes Version Recommended Istio Version Recommended Knative Version 1.28 1.22 1.15 1.29 1.22,1.23 1.15,1.16 1.30 1.22,1.23 1.15,1.16","title":"Recommended Version Matrix"},{"location":"admin/serverless/serverless/#1-install-knative-serving","text":"Please refer to Knative Serving install guide . Note If you are looking to use PodSpec fields such as nodeSelector, affinity or tolerations which are now supported in the v1beta1 API spec, you need to turn on the corresponding feature flags in your Knative configuration. Warning Knative 1.13.1 requires Istio 1.20+, gRPC routing does not work with previous Istio releases, see release notes .","title":"1. Install Knative Serving"},{"location":"admin/serverless/serverless/#2-install-networking-layer","text":"The recommended networking layer for KServe is Istio as currently it works best with KServe, please refer to the Istio install guide . Alternatively you can also choose other networking layers like Kourier or Contour , see how to install Kourier with KServe guide .","title":"2. Install Networking Layer"},{"location":"admin/serverless/serverless/#3-install-cert-manager","text":"The minimally required Cert Manager version is 1.15.0 and you can refer to Cert Manager . Note Cert manager is required to provision webhook certs for production grade installation, alternatively you can run self signed certs generation script.","title":"3. Install Cert Manager"},{"location":"admin/serverless/serverless/#4-install-kserve","text":"kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml","title":"4. Install KServe"},{"location":"admin/serverless/serverless/#5-install-kserve-built-in-clusterservingruntimes","text":"0.13.0 kubectl kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yaml Note ClusterServingRuntimes are required to create InferenceService for built-in model serving runtimes with KServe v0.8.0 or higher.","title":"5. Install KServe Built-in ClusterServingRuntimes"},{"location":"admin/serverless/kourier_networking/","text":"Deploy InferenceService with Alternative Networking Layer \u00b6 KServe creates the top level Istio Virtual Service for routing to InferenceService components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, Kourier is an alternative networking layer and the following steps show how you can deploy KServe with Kourier . Install Kourier Networking Layer \u00b6 Please refer to the Serverless Installation Guide and change the second step to install Kourier instead of Istio . Install the Kourier networking layer: kubectl apply -f https://github.com/knative/net-kourier/releases/download/ ${ KNATIVE_VERSION } /kourier.yaml Configure Knative Serving to use Kourier: kubectl patch configmap/config-network \\ --namespace knative-serving \\ --type merge \\ --patch '{\"data\":{\"ingress-class\":\"kourier.ingress.networking.knative.dev\"}}' Verify Kourier installation: kubectl get pods -n knative-serving && kubectl get pods -n kourier-system Expected Output NAME READY STATUS RESTARTS AGE activator-77db7d9dd7-kbrgr 1 /1 Running 0 10m autoscaler-67dbf79b95-htnp9 1 /1 Running 0 10m controller-684b6bc97f-ffm58 1 /1 Running 0 10m domain-mapping-6d99d99978-ktmrf 1 /1 Running 0 10m domainmapping-webhook-5f998498b6-sddnm 1 /1 Running 0 10m net-kourier-controller-68967d76dc-ncj2n 1 /1 Running 0 10m webhook-97bdc7b4d-nr7qf 1 /1 Running 0 10m NAME READY STATUS RESTARTS AGE 3scale-kourier-gateway-54c49c8ff5-x8tgn 1 /1 Running 0 10m Edit inferenceservice-config configmap to disable Istio top level virtual host: kubectl edit configmap/inferenceservice-config --namespace kserve # Add the flag `\"disableIstioVirtualHost\": true` under the ingress section ingress : | - { \"disableIstioVirtualHost\" : true } Restart the KServe Controller kubectl rollout restart deployment kserve-controller-manager -n kserve Deploy InferenceService for Testing Kourier Gateway \u00b6 Create the InferenceService \u00b6 New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : model : modelFormat : name : pmml storageUri : \"gs://kfserving-examples/models/pmml\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : pmml : storageUri : gs://kfserving-examples/models/pmml kubectl apply -f pmml.yaml Expected Output $ inferenceservice.serving.kserve.io/pmml-demo created Run a Prediction \u00b6 Note that when setting INGRESS_HOST and INGRESS_PORT following the determining the ingress IP and ports guide you need to replace istio-ingressgateway with kourier-gateway . For example if you choose to do Port Forward for testing you need to select the kourier-gateway pod as following. kubectl port-forward --namespace kourier-system \\ $( kubectl get pod -n kourier-system -l \"app=3scale-kourier-gateway\" --output = jsonpath = \"{.items[0].metadata.name}\" ) 8080 :8080 export INGRESS_HOST = localhost export INGRESS_PORT = 8080 Make sure that you create a file named pmml-input.json with the following content, under your current terminal path. { \"instances\" : [ [ 5.1 , 3.5 , 1.4 , 0.2 ] ] } Send a prediction request to the InferenceService and check the output. MODEL_NAME = pmml-demo INPUT_PATH = @./pmml-input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice pmml-demo -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict -d $INPUT_PATH Expected Output * Trying 127 .0.0.1... * TCP_NODELAY set * Connected to localhost ( 127 .0.0.1 ) port 8080 ( #0) > POST /v1/models/pmml-demo:predict HTTP/1.1 > Host: pmml-demo-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 45 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 45 out of 45 bytes < HTTP/1.1 200 OK < content-length: 144 < content-type: application/json ; charset = UTF-8 < date: Wed, 14 Sep 2022 13 :30:09 GMT < server: envoy < x-envoy-upstream-service-time: 58 < * Connection #0 to host localhost left intact { \"predictions\" : [{ \"Species\" : \"setosa\" , \"Probability_setosa\" : 1 .0, \"Probability_versicolor\" : 0 .0, \"Probability_virginica\" : 0 .0, \"Node_Id\" : \"2\" }]}","title":"Kourier Networking Layer"},{"location":"admin/serverless/kourier_networking/#deploy-inferenceservice-with-alternative-networking-layer","text":"KServe creates the top level Istio Virtual Service for routing to InferenceService components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, Kourier is an alternative networking layer and the following steps show how you can deploy KServe with Kourier .","title":"Deploy InferenceService with Alternative Networking Layer"},{"location":"admin/serverless/kourier_networking/#install-kourier-networking-layer","text":"Please refer to the Serverless Installation Guide and change the second step to install Kourier instead of Istio . Install the Kourier networking layer: kubectl apply -f https://github.com/knative/net-kourier/releases/download/ ${ KNATIVE_VERSION } /kourier.yaml Configure Knative Serving to use Kourier: kubectl patch configmap/config-network \\ --namespace knative-serving \\ --type merge \\ --patch '{\"data\":{\"ingress-class\":\"kourier.ingress.networking.knative.dev\"}}' Verify Kourier installation: kubectl get pods -n knative-serving && kubectl get pods -n kourier-system Expected Output NAME READY STATUS RESTARTS AGE activator-77db7d9dd7-kbrgr 1 /1 Running 0 10m autoscaler-67dbf79b95-htnp9 1 /1 Running 0 10m controller-684b6bc97f-ffm58 1 /1 Running 0 10m domain-mapping-6d99d99978-ktmrf 1 /1 Running 0 10m domainmapping-webhook-5f998498b6-sddnm 1 /1 Running 0 10m net-kourier-controller-68967d76dc-ncj2n 1 /1 Running 0 10m webhook-97bdc7b4d-nr7qf 1 /1 Running 0 10m NAME READY STATUS RESTARTS AGE 3scale-kourier-gateway-54c49c8ff5-x8tgn 1 /1 Running 0 10m Edit inferenceservice-config configmap to disable Istio top level virtual host: kubectl edit configmap/inferenceservice-config --namespace kserve # Add the flag `\"disableIstioVirtualHost\": true` under the ingress section ingress : | - { \"disableIstioVirtualHost\" : true } Restart the KServe Controller kubectl rollout restart deployment kserve-controller-manager -n kserve","title":"Install Kourier Networking Layer"},{"location":"admin/serverless/kourier_networking/#deploy-inferenceservice-for-testing-kourier-gateway","text":"","title":"Deploy InferenceService for Testing Kourier Gateway"},{"location":"admin/serverless/kourier_networking/#create-the-inferenceservice","text":"New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : model : modelFormat : name : pmml storageUri : \"gs://kfserving-examples/models/pmml\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"pmml-demo\" spec : predictor : pmml : storageUri : gs://kfserving-examples/models/pmml kubectl apply -f pmml.yaml Expected Output $ inferenceservice.serving.kserve.io/pmml-demo created","title":"Create the InferenceService"},{"location":"admin/serverless/kourier_networking/#run-a-prediction","text":"Note that when setting INGRESS_HOST and INGRESS_PORT following the determining the ingress IP and ports guide you need to replace istio-ingressgateway with kourier-gateway . For example if you choose to do Port Forward for testing you need to select the kourier-gateway pod as following. kubectl port-forward --namespace kourier-system \\ $( kubectl get pod -n kourier-system -l \"app=3scale-kourier-gateway\" --output = jsonpath = \"{.items[0].metadata.name}\" ) 8080 :8080 export INGRESS_HOST = localhost export INGRESS_PORT = 8080 Make sure that you create a file named pmml-input.json with the following content, under your current terminal path. { \"instances\" : [ [ 5.1 , 3.5 , 1.4 , 0.2 ] ] } Send a prediction request to the InferenceService and check the output. MODEL_NAME = pmml-demo INPUT_PATH = @./pmml-input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice pmml-demo -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict -d $INPUT_PATH Expected Output * Trying 127 .0.0.1... * TCP_NODELAY set * Connected to localhost ( 127 .0.0.1 ) port 8080 ( #0) > POST /v1/models/pmml-demo:predict HTTP/1.1 > Host: pmml-demo-predictor-default.default.example.com > User-Agent: curl/7.58.0 > Accept: */* > Content-Length: 45 > Content-Type: application/x-www-form-urlencoded > * upload completely sent off: 45 out of 45 bytes < HTTP/1.1 200 OK < content-length: 144 < content-type: application/json ; charset = UTF-8 < date: Wed, 14 Sep 2022 13 :30:09 GMT < server: envoy < x-envoy-upstream-service-time: 58 < * Connection #0 to host localhost left intact { \"predictions\" : [{ \"Species\" : \"setosa\" , \"Probability_setosa\" : 1 .0, \"Probability_versicolor\" : 0 .0, \"Probability_virginica\" : 0 .0, \"Node_Id\" : \"2\" }]}","title":"Run a Prediction"},{"location":"admin/serverless/servicemesh/","text":"Macro Syntax Error \u00b6 File : admin/serverless/servicemesh/README.md Line 68 in Markdown file: Missing end of comment tag ### Disable Top Level Virtual Service {#disable-top-level-vs}","title":"Istio Service Mesh"},{"location":"admin/serverless/servicemesh/#macro-syntax-error","text":"File : admin/serverless/servicemesh/README.md Line 68 in Markdown file: Missing end of comment tag ### Disable Top Level Virtual Service {#disable-top-level-vs}","title":"Macro Syntax Error"},{"location":"api/api/","text":"KServe API \u00b6","title":"KServe API"},{"location":"api/api/#kserve-api","text":"","title":"KServe API"},{"location":"blog/_index/","text":"","title":" index"},{"location":"blog/articles/2021-09-27-kfserving-transition/","text":"Authors \u00b6 Dan Sun and Animesh Singh on behalf of the Kubeflow Serving Working Group KFServing is now KServe \u00b6 We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads. The project has been rebranded from KFServing to KServe , and we are planning to graduate the project from Kubeflow Project later this year. Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features: - A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks. - Provide performant, standardized inference protocol. - Serverless inference according to live traffic patterns, supporting \u201cScale-to-zero\u201d on both CPUs and GPUs. - Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring. - Support for deploying thousands of models at scale and inference graph capability for multiple models. KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we\u2019ve seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs it's own website and independent GitHub organization. What's Next \u00b6 Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program . For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users! KServe Key Links \u00b6 Website Github Slack(#kubeflow-kfserving) Contributor Acknowledgement \u00b6 We'd like to thank all the KServe contributors for this transition work! Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"KFserving Transition"},{"location":"blog/articles/2021-09-27-kfserving-transition/#authors","text":"Dan Sun and Animesh Singh on behalf of the Kubeflow Serving Working Group","title":"Authors"},{"location":"blog/articles/2021-09-27-kfserving-transition/#kfserving-is-now-kserve","text":"We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads. The project has been rebranded from KFServing to KServe , and we are planning to graduate the project from Kubeflow Project later this year. Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features: - A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks. - Provide performant, standardized inference protocol. - Serverless inference according to live traffic patterns, supporting \u201cScale-to-zero\u201d on both CPUs and GPUs. - Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring. - Support for deploying thousands of models at scale and inference graph capability for multiple models. KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we\u2019ve seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs it's own website and independent GitHub organization.","title":"KFServing is now KServe"},{"location":"blog/articles/2021-09-27-kfserving-transition/#whats-next","text":"Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program . For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!","title":"What's Next"},{"location":"blog/articles/2021-09-27-kfserving-transition/#kserve-key-links","text":"Website Github Slack(#kubeflow-kfserving)","title":"KServe Key Links"},{"location":"blog/articles/2021-09-27-kfserving-transition/#contributor-acknowledgement","text":"We'd like to thank all the KServe contributors for this transition work! Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"Contributor Acknowledgement"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/","text":"Authors \u00b6 Dan Sun , Animesh Singh , Yuzhui Liu , Vedant Padwal on behalf of the KServe Working Group. KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe. What's Changed? \u00b6 InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826 , the migration job is created for smooth transition. Python SDK name is changed from kfserving to kserve . KServe Installation manifests #1824 . Models-web-app is separated out of the kserve repository to models-web-app . Docs and examples are moved to separate repository website . KServe images are migrated to kserve docker hub account. v1alpha2 API group is deprecated #1850 . \ud83c\udf08 What's New? \u00b6 ModelMesh project is joining KServe under repository modelmesh-serving ! ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe . (Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode. KServe now has its own documentation website temporarily hosted on website . Support v1 crd and webhook configuration for Kubernetes 1.22 #1837 . Triton model serving runtime now defaults to 21.09 version #1840 . \ud83d\udc1e What's Fixed? \u00b6 Bug fix for Azure blob storage #1845 . Tar/Zip support for all storage options #1836 . Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780 . Torchserve custom package install fix #1619 . Join the community \u00b6 Visit our Website or GitHub Join the Slack(#kubeflow-kfserving) Attend a Biweekly community meeting on Wednesday 9am PST Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users! Contributors \u00b6 We would like to thank everyone for their efforts on v0.7 Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"KServe 0.7 Release"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#authors","text":"Dan Sun , Animesh Singh , Yuzhui Liu , Vedant Padwal on behalf of the KServe Working Group. KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe.","title":"Authors"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-changed","text":"InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826 , the migration job is created for smooth transition. Python SDK name is changed from kfserving to kserve . KServe Installation manifests #1824 . Models-web-app is separated out of the kserve repository to models-web-app . Docs and examples are moved to separate repository website . KServe images are migrated to kserve docker hub account. v1alpha2 API group is deprecated #1850 .","title":"What's Changed?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-new","text":"ModelMesh project is joining KServe under repository modelmesh-serving ! ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe . (Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode. KServe now has its own documentation website temporarily hosted on website . Support v1 crd and webhook configuration for Kubernetes 1.22 #1837 . Triton model serving runtime now defaults to 21.09 version #1840 .","title":"\ud83c\udf08 What's New?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#whats-fixed","text":"Bug fix for Azure blob storage #1845 . Tar/Zip support for all storage options #1836 . Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780 . Torchserve custom package install fix #1619 .","title":"\ud83d\udc1e What's Fixed?"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack(#kubeflow-kfserving) Attend a Biweekly community meeting on Wednesday 9am PST Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!","title":"Join the community"},{"location":"blog/articles/2021-10-11-KServe-0.7-release/#contributors","text":"We would like to thank everyone for their efforts on v0.7 Andrews Arokiam Animesh Singh Chin Huang Dan Sun Jagadeesh Jinchi He Nick Hill Paul Van Eck Qianshan Chen Suresh Nakkiran Sukumar Gaonkar Theofilos Papapanagiotou Tommy Li Vedant Padwal Yao Xiao Yuzhui Liu","title":"Contributors"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/","text":"Macro Syntax Error \u00b6 File : blog/articles/2022-02-18-KServe-0.8-release.md Line 67 in Markdown file: unexpected '.' - --model_name={{.Name}}","title":"KServe 0.8 Release"},{"location":"blog/articles/2022-02-18-KServe-0.8-release/#macro-syntax-error","text":"File : blog/articles/2022-02-18-KServe-0.8-release.md Line 67 in Markdown file: unexpected '.' - --model_name={{.Name}}","title":"Macro Syntax Error"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/","text":"Announcing: KServe v0.9.0 \u00b6 Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project ! In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API! Introduce InferenceGraph \u00b6 The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results. KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way. InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence , Switch , Ensemble , Splitter . Sequence Node : It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration. Switch Node : It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request. Ensemble Node : A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a \"majority vote\" method. Multiple regression trees are often combined using various averaging techniques. Splitter Node : It allows users to split the traffic to multiple targets using a weighted distribution. apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"cat-dog-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/cat_dog_classification --- apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"dog-breed-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/dog_breed_classification --- apiVersion : \"serving.kserve.io/v1alpha1\" kind : \"InferenceGraph\" metadata : name : \"dog-breed-pipeline\" spec : nodes : root : routerType : Sequence steps : - serviceName : cat-dog-classifier name : cat_dog_classifier # step name - serviceName : dog-breed-classifier name : dog_breed_classifier data : $request condition : \"[@this].#(predictions.0==\\\"dog\\\")\" Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial . InferenceService API for ModelMesh \u00b6 The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh\u2019s needs. Storage Spec \u00b6 To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example: storage : key : localMinIO # Credential key for the destination storage in the common secret path : sklearn # Model path inside the bucket # schemaPath: null # Optional schema files for payload schema parameters : # Parameters to override the default values inside the common secret. bucket : example-models Learn more here . Model Status \u00b6 For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include: states - State information of the predictor's model. activeModelState - The state of the model currently being served by the predictor's endpoints. targetModelState - This will be set only when transitionStatus is not UpToDate , meaning that the target model differs from the currently-active model. transitionStatus - Indicates state of the predictor relative to its current spec. modelCopies - Model copy information of the predictor's model. lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value. Deploying on ModelMesh \u00b6 For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-tensorflow-mnist annotations : serving.kserve.io/deploymentMode : ModelMesh spec : predictor : model : modelFormat : name : tensorflow storage : key : localMinIO path : tensorflow/mnist.savedmodel Other New Features: \u00b6 Support serving MLFlow model format via MLServer serving runtime. Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode. Support InferenceService ingress class and url domain template configuration for RawDeployment mode. ModelMesh now has a default OpenVINO Model Server ServingRuntime. What\u2019s Changed? \u00b6 The KServe controller manager is changed from StatefulSet to Deployment to support HA mode. log4j security vulnerability fix Upgrade TorchServe serving runtime to 0.6.0 Update MLServer serving runtime to 1.0.0 Check out the full release notes for KServe and ModelMesh for more details. Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for contributing or checking out KServe! \u2013 The KServe Working Group","title":"KServe 0.9 Release"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#announcing-kserve-v090","text":"Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project ! In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API!","title":"Announcing: KServe v0.9.0"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#introduce-inferencegraph","text":"The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results. KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way. InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence , Switch , Ensemble , Splitter . Sequence Node : It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration. Switch Node : It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request. Ensemble Node : A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a \"majority vote\" method. Multiple regression trees are often combined using various averaging techniques. Splitter Node : It allows users to split the traffic to multiple targets using a weighted distribution. apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"cat-dog-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/cat_dog_classification --- apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"dog-breed-classifier\" spec : predictor : pytorch : resources : requests : cpu : 100m storageUri : gs://kfserving-examples/models/torchserve/dog_breed_classification --- apiVersion : \"serving.kserve.io/v1alpha1\" kind : \"InferenceGraph\" metadata : name : \"dog-breed-pipeline\" spec : nodes : root : routerType : Sequence steps : - serviceName : cat-dog-classifier name : cat_dog_classifier # step name - serviceName : dog-breed-classifier name : dog_breed_classifier data : $request condition : \"[@this].#(predictions.0==\\\"dog\\\")\" Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial .","title":"Introduce InferenceGraph"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#inferenceservice-api-for-modelmesh","text":"The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh\u2019s needs.","title":"InferenceService API for ModelMesh"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#storage-spec","text":"To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example: storage : key : localMinIO # Credential key for the destination storage in the common secret path : sklearn # Model path inside the bucket # schemaPath: null # Optional schema files for payload schema parameters : # Parameters to override the default values inside the common secret. bucket : example-models Learn more here .","title":"Storage Spec"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#model-status","text":"For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include: states - State information of the predictor's model. activeModelState - The state of the model currently being served by the predictor's endpoints. targetModelState - This will be set only when transitionStatus is not UpToDate , meaning that the target model differs from the currently-active model. transitionStatus - Indicates state of the predictor relative to its current spec. modelCopies - Model copy information of the predictor's model. lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.","title":"Model Status"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#deploying-on-modelmesh","text":"For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below: apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : example-tensorflow-mnist annotations : serving.kserve.io/deploymentMode : ModelMesh spec : predictor : model : modelFormat : name : tensorflow storage : key : localMinIO path : tensorflow/mnist.savedmodel","title":"Deploying on ModelMesh"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#other-new-features","text":"Support serving MLFlow model format via MLServer serving runtime. Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode. Support InferenceService ingress class and url domain template configuration for RawDeployment mode. ModelMesh now has a default OpenVINO Model Server ServingRuntime.","title":"Other New Features:"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#whats-changed","text":"The KServe controller manager is changed from StatefulSet to Deployment to support HA mode. log4j security vulnerability fix Upgrade TorchServe serving runtime to 0.6.0 Update MLServer serving runtime to 1.0.0 Check out the full release notes for KServe and ModelMesh for more details.","title":"What\u2019s Changed?"},{"location":"blog/articles/2022-07-21-KServe-0.9-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thank you for contributing or checking out KServe! \u2013 The KServe Working Group","title":"Join the community"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/","text":"Announcing: KServe v0.10.0 \u00b6 We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService. KServe Networking Options \u00b6 Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline . KServe Telemetry for Serving Runtimes \u00b6 We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess , predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime . Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations. Open(v2) Inference Protocol Support Coverage \u00b6 As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol . In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor . from kserve import Model def image_transform ( byte_array ): image_processing = transforms . Compose ([ transforms . ToTensor (), transforms . Normalize (( 0.1307 ,), ( 0.3081 ,)) ]) image = Image . open ( io . BytesIO ( byte_array )) tensor = image_processing ( image ) . numpy () return tensor class CustomModel ( Model ): def predict ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferResponse : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) output = self . model ( input_tensors ) torch . nn . functional . softmax ( output , dim = 1 ) values , top_5 = torch . topk ( output , 5 ) result = values . flatten () . tolist () response_id = generate_uuid () infer_output = InferOutput ( name = \"output-0\" , shape = list ( values . shape ), datatype = \"FP32\" , data = result ) infer_response = InferResponse ( model_name = self . name , infer_outputs = [ infer_output ], response_id = response_id ) return infer_response class CustomTransformer ( Model ): def preprocess ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferRequest : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) infer_inputs = [ InferInput ( name = \"INPUT__0\" , datatype = 'FP32' , shape = list ( input_tensors . shape ), data = input_tensors )] infer_request = InferRequest ( model_name = self . model_name , infer_inputs = infer_inputs ) return infer_request You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol. Warning A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess , predict and postprocess handlers. Please check the following matrix for supported ModelFormats and ServingRuntimes . Model Format v1 Open(v2) REST/gRPC Tensorflow \u2705 TFServing \u2705 Triton PyTorch \u2705 TorchServe \u2705 TorchServe TorchScript \u2705 TorchServe \u2705 Triton ONNX \u274c \u2705 Triton Scikit-learn \u2705 KServe \u2705 MLServer XGBoost \u2705 KServe \u2705 MLServer LightGBM \u2705 KServe \u2705 MLServer MLFlow \u274c \u2705 MLServer Custom \u2705 KServe \u2705 KServe Multi-Arch Image Support \u00b6 KServe control plane images kserve-controller , kserve/agent , kserve/router are now supported for multiple architectures: ppc64le , arm64 , amd64 , s390x . KServe Storage Credentials Support \u00b6 Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA . Support Azure Blobs with managed identity . ModelMesh updates \u00b6 ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8. Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving. Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them. Other Changes: \u00b6 For a complete change list please read the release notes from KServe v0.10 and ModelMesh v0.10 . Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.10 release! Steve Larkin Stephan Schielke Curtis Maddalozzo Zhongcheng Lao Dimitris Aragiorgis Pan Li tjandy98 Sukumar Gaonkar Rachit Chauhan Rafael Vasquez Tim Kleinloog Christian Kadner ddelange Lize Cai sangjune.park Suresh Nakkeran Konstantinos Messis Matt Rose Alexa Griffith Jagadeesh J Alex Lembiyeuski Yuki Iwai Andrews Arokiam Xin Fu adilhusain-s Pranav Pandit C1berwiz dilverse Yuan Tang Dan Sun Nick Hill The KServe Working Group","title":"KServe 0.10 Release"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#announcing-kserve-v0100","text":"We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService.","title":"Announcing: KServe v0.10.0"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-networking-options","text":"Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline .","title":"KServe Networking Options"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-telemetry-for-serving-runtimes","text":"We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess , predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime . Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations.","title":"KServe Telemetry for Serving Runtimes"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#openv2-inference-protocol-support-coverage","text":"As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol . In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor . from kserve import Model def image_transform ( byte_array ): image_processing = transforms . Compose ([ transforms . ToTensor (), transforms . Normalize (( 0.1307 ,), ( 0.3081 ,)) ]) image = Image . open ( io . BytesIO ( byte_array )) tensor = image_processing ( image ) . numpy () return tensor class CustomModel ( Model ): def predict ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferResponse : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) output = self . model ( input_tensors ) torch . nn . functional . softmax ( output , dim = 1 ) values , top_5 = torch . topk ( output , 5 ) result = values . flatten () . tolist () response_id = generate_uuid () infer_output = InferOutput ( name = \"output-0\" , shape = list ( values . shape ), datatype = \"FP32\" , data = result ) infer_response = InferResponse ( model_name = self . name , infer_outputs = [ infer_output ], response_id = response_id ) return infer_response class CustomTransformer ( Model ): def preprocess ( self , request : InferRequest , headers : Dict [ str , str ]) -> InferRequest : input_tensors = [ image_transform ( instance ) for instance in request . inputs [ 0 ] . data ] input_tensors = np . asarray ( input_tensors ) infer_inputs = [ InferInput ( name = \"INPUT__0\" , datatype = 'FP32' , shape = list ( input_tensors . shape ), data = input_tensors )] infer_request = InferRequest ( model_name = self . model_name , infer_inputs = infer_inputs ) return infer_request You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol. Warning A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess , predict and postprocess handlers. Please check the following matrix for supported ModelFormats and ServingRuntimes . Model Format v1 Open(v2) REST/gRPC Tensorflow \u2705 TFServing \u2705 Triton PyTorch \u2705 TorchServe \u2705 TorchServe TorchScript \u2705 TorchServe \u2705 Triton ONNX \u274c \u2705 Triton Scikit-learn \u2705 KServe \u2705 MLServer XGBoost \u2705 KServe \u2705 MLServer LightGBM \u2705 KServe \u2705 MLServer MLFlow \u274c \u2705 MLServer Custom \u2705 KServe \u2705 KServe","title":"Open(v2) Inference Protocol Support Coverage"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#multi-arch-image-support","text":"KServe control plane images kserve-controller , kserve/agent , kserve/router are now supported for multiple architectures: ppc64le , arm64 , amd64 , s390x .","title":"Multi-Arch Image Support"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#kserve-storage-credentials-support","text":"Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA . Support Azure Blobs with managed identity .","title":"KServe Storage Credentials Support"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#modelmesh-updates","text":"ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8. Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving. Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.","title":"ModelMesh updates"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#other-changes","text":"For a complete change list please read the release notes from KServe v0.10 and ModelMesh v0.10 .","title":"Other Changes:"},{"location":"blog/articles/2023-02-05-KServe-0.10-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.10 release! Steve Larkin Stephan Schielke Curtis Maddalozzo Zhongcheng Lao Dimitris Aragiorgis Pan Li tjandy98 Sukumar Gaonkar Rachit Chauhan Rafael Vasquez Tim Kleinloog Christian Kadner ddelange Lize Cai sangjune.park Suresh Nakkeran Konstantinos Messis Matt Rose Alexa Griffith Jagadeesh J Alex Lembiyeuski Yuki Iwai Andrews Arokiam Xin Fu adilhusain-s Pranav Pandit C1berwiz dilverse Yuan Tang Dan Sun Nick Hill The KServe Working Group","title":"Join the community"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/","text":"Announcing: KServe v0.11 \u00b6 We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe. Here is a summary of the key changes: KServe Core Inference Enhancements \u00b6 Support path based routing which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving// . Please refer to the doc for how to enable path based routing. Introduced priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc . Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration: apiVersion : \"serving.kserve.io/v1alpha1\" kind : ClusterStorageContainer metadata : name : default spec : container : name : storage-initializer image : kserve/model-registry:latest resources : requests : memory : 100Mi cpu : 100m limits : memory : 1Gi cpu : \"1\" supportedUriFormats : - prefix : model-registry:// Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps: apiVersion : serving.kserve.io/v1alpha1 kind : InferenceGraph metadata : name : graph_with_switch_node spec : nodes : root : routerType : Sequence steps : - name : \"rootStep1\" nodeName : node1 dependency : Hard - name : \"rootStep2\" serviceName : {{ success_200_isvc_id }} node1 : routerType : Switch steps : - name : \"node1Step1\" serviceName : {{ error_404_isvc_id }} condition : \"[@this].#(decision_picker==ERROR)\" dependency : Hard For more details please refer to the issue . Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs . Enhanced Python SDK Dependency Management \u00b6 KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install . While pip install still works, we highly recommend using poetry to ensure predictable dependency management. The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage] . KServe Python Runtimes Improvements \u00b6 KServe Python Runtimes including sklearnserver , lgbserver , xgbserver now support the open inference protocol for both REST and gRPC. Logging improvements including adding Uvicorn access logging and a default KServe logger. Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities. LLM Runtimes \u00b6 TorchServe LLM Runtime \u00b6 KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime. vLLM Runtime \u00b6 Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol. ModelMesh Updates \u00b6 Storing Models on Kubernetes Persistent Volumes (PVC) \u00b6 ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes . Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether. Horizontal Pod Autoscaling (HPA) \u00b6 Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand. Model Metrics, Metrics Dashboard, Payload Event Logging \u00b6 ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or \"heavy hitter\" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses. A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment. The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems. What's Changed? \u00b6 To allow longer InferenceService name due to DNS max length limits from issue , the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url. Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode. Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11 . Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.11 release! The KServe Working Group","title":"KServe 0.11 Release"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#announcing-kserve-v011","text":"We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe. Here is a summary of the key changes:","title":"Announcing: KServe v0.11"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#kserve-core-inference-enhancements","text":"Support path based routing which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving// . Please refer to the doc for how to enable path based routing. Introduced priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc . Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration: apiVersion : \"serving.kserve.io/v1alpha1\" kind : ClusterStorageContainer metadata : name : default spec : container : name : storage-initializer image : kserve/model-registry:latest resources : requests : memory : 100Mi cpu : 100m limits : memory : 1Gi cpu : \"1\" supportedUriFormats : - prefix : model-registry:// Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps: apiVersion : serving.kserve.io/v1alpha1 kind : InferenceGraph metadata : name : graph_with_switch_node spec : nodes : root : routerType : Sequence steps : - name : \"rootStep1\" nodeName : node1 dependency : Hard - name : \"rootStep2\" serviceName : {{ success_200_isvc_id }} node1 : routerType : Switch steps : - name : \"node1Step1\" serviceName : {{ error_404_isvc_id }} condition : \"[@this].#(decision_picker==ERROR)\" dependency : Hard For more details please refer to the issue . Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs .","title":"KServe Core Inference Enhancements"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#enhanced-python-sdk-dependency-management","text":"KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install . While pip install still works, we highly recommend using poetry to ensure predictable dependency management. The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage] .","title":"Enhanced Python SDK Dependency Management"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#kserve-python-runtimes-improvements","text":"KServe Python Runtimes including sklearnserver , lgbserver , xgbserver now support the open inference protocol for both REST and gRPC. Logging improvements including adding Uvicorn access logging and a default KServe logger. Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.","title":"KServe Python Runtimes Improvements"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#llm-runtimes","text":"","title":"LLM Runtimes"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#torchserve-llm-runtime","text":"KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.","title":"TorchServe LLM Runtime"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#vllm-runtime","text":"Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.","title":"vLLM Runtime"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#modelmesh-updates","text":"","title":"ModelMesh Updates"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#storing-models-on-kubernetes-persistent-volumes-pvc","text":"ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes . Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.","title":"Storing Models on Kubernetes Persistent Volumes (PVC)"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#horizontal-pod-autoscaling-hpa","text":"Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand.","title":"Horizontal Pod Autoscaling (HPA)"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#model-metrics-metrics-dashboard-payload-event-logging","text":"ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or \"heavy hitter\" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses. A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment. The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.","title":"Model Metrics, Metrics Dashboard, Payload Event Logging"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#whats-changed","text":"To allow longer InferenceService name due to DNS max length limits from issue , the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url. Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode. Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11 .","title":"What's Changed?"},{"location":"blog/articles/2023-10-08-KServe-0.11-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.11 release! The KServe Working Group","title":"Join the community"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/","text":"From Serverless Predictive Inference to Generative Inference: Introducing KServe v0.13 \u00b6 We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards. Below are a summary of the key changes. Enhanced Hugging Face Runtime Support \u00b6 KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a KServe Hugging Face Serving Runtime , kserve-huggingfaceserver . With this implementation, KServe can now automatically infer a task from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation. Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task. apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : huggingface-bert spec : predictor : model : modelFormat : name : huggingface args : - --model_name=bert - --model_id=bert-base-uncased - --tensor_input_names=input_ids resources : limits : cpu : \"1\" memory : 2Gi nvidia.com/gpu : \"1\" requests : cpu : 100m memory : 2Gi nvidia.com/gpu : \"1\" You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details here . vLLM support \u00b6 Version 0.13 introduces dedicated runtime support for vLLM , for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below. apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : huggingface-llama3 spec : predictor : model : modelFormat : name : huggingface args : - --model_name=llama3 - --model_id=meta-llama/meta-llama-3-8b-instruct resources : limits : cpu : \"6\" memory : 24Gi nvidia.com/gpu : \"1\" requests : cpu : \"6\" memory : 24Gi nvidia.com/gpu : \"1\" See more details in our updated docs to Deploy the Llama3 model with Hugging Face LLM Serving Runtime . Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the --backend=huggingface arg. OpenAI Schema Integration \u00b6 Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models: /openai/v1/completions /openai/v1/chat/completions /openai/v1/models These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The chat completions endpoint is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The completions endpoint is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a prompt . Read more about the chat completions and completions endpoints int the OpenAI API docs. This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex. Future Plan \u00b6 Support other tasks like text embeddings #3572 . Support more LLM backend options in the future, such as TensorRT-LLM. Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) #3461 . KEDA integration for token based LLM Autoscaling #3561 . Other Changes \u00b6 This release also includes several enhancements and changes: What's New? \u00b6 Async streaming support for v1 endpoints #3402 . Support for .json and .ubj model formats in the XGBoost server image #3546 . Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service #2747 . Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments #3470 . What's Changed? \u00b6 Removed Seldon Alibi dependency #3380 . Removal of conversion webhook from manifests. #3344 . For complete details on the new features and updates, visit our official release notes . Join the community \u00b6 Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.13 release! The KServe Project","title":"KServe 0.13 Release"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#from-serverless-predictive-inference-to-generative-inference-introducing-kserve-v013","text":"We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards. Below are a summary of the key changes.","title":"From Serverless Predictive Inference to Generative Inference: Introducing KServe v0.13"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#enhanced-hugging-face-runtime-support","text":"KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a KServe Hugging Face Serving Runtime , kserve-huggingfaceserver . With this implementation, KServe can now automatically infer a task from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation. Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task. apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : huggingface-bert spec : predictor : model : modelFormat : name : huggingface args : - --model_name=bert - --model_id=bert-base-uncased - --tensor_input_names=input_ids resources : limits : cpu : \"1\" memory : 2Gi nvidia.com/gpu : \"1\" requests : cpu : 100m memory : 2Gi nvidia.com/gpu : \"1\" You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details here .","title":"Enhanced Hugging Face Runtime Support"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#vllm-support","text":"Version 0.13 introduces dedicated runtime support for vLLM , for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below. apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : huggingface-llama3 spec : predictor : model : modelFormat : name : huggingface args : - --model_name=llama3 - --model_id=meta-llama/meta-llama-3-8b-instruct resources : limits : cpu : \"6\" memory : 24Gi nvidia.com/gpu : \"1\" requests : cpu : \"6\" memory : 24Gi nvidia.com/gpu : \"1\" See more details in our updated docs to Deploy the Llama3 model with Hugging Face LLM Serving Runtime . Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the --backend=huggingface arg.","title":"vLLM support"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#openai-schema-integration","text":"Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models: /openai/v1/completions /openai/v1/chat/completions /openai/v1/models These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The chat completions endpoint is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The completions endpoint is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a prompt . Read more about the chat completions and completions endpoints int the OpenAI API docs. This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.","title":"OpenAI Schema Integration"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#future-plan","text":"Support other tasks like text embeddings #3572 . Support more LLM backend options in the future, such as TensorRT-LLM. Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) #3461 . KEDA integration for token based LLM Autoscaling #3561 .","title":"Future Plan"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#other-changes","text":"This release also includes several enhancements and changes:","title":"Other Changes"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#whats-new","text":"Async streaming support for v1 endpoints #3402 . Support for .json and .ubj model formats in the XGBoost server image #3546 . Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service #2747 . Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments #3470 .","title":"What's New?"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#whats-changed","text":"Removed Seldon Alibi dependency #3380 . Removal of conversion webhook from manifests. #3344 . For complete details on the new features and updates, visit our official release notes .","title":"What's Changed?"},{"location":"blog/articles/2024-05-15-KServe-0.13-release/#join-the-community","text":"Visit our Website or GitHub Join the Slack ( #kserve ) Attend our community meeting by subscribing to the KServe calendar . View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! Thanks for all the contributors who have made the commits to 0.13 release! The KServe Project","title":"Join the community"},{"location":"blog/articles/_index/","text":"","title":" index"},{"location":"community/adopters/","text":"Adopters of KServe \u00b6 This page contains a list of organizations who are using KServe either in production, or providing integrations or deployment options with their Cloud or product offerings. If you'd like to be included here, please send a pull request which modifies this file. Please keep the list in alphabetical order. Organization Contact Advanced Micro Devices Varun Sharma Alauda Wu Yi Amazon Web Services Ellis Tarn Bloomberg Dan Sun Cars24 Swapnesh Khare Charmed Kubeflow from Canonical Daniela Plasencia Cisco Krishna Durai Cloudera Zoram Thanga CoreWeave Peter Salanki Deeploy Tim Kleinloog Gojek Willem Pienaar Halodoc ID Joinal Ahmed Hewlett Packard Enterprise (HPE) Jerry Harrow Hypermode Kevin Mingtarja IBM Nick Hill Inspur Qingshan Chen Intuit Rachit Chauhan Kubeflow on Google Cloud James Liu Max Kelsen Jacob O'Farrell Naver Mark Winter Nuance Jeff Griffith NVIDIA David Goodwin One Convergence Subra Ongole PITS Global Data Recovery Services Pheianox Red Hat Taneem Ibrahim Seldon Alex Housley Patterson Consulting Josh Patterson Samsung SDS Hanbae Seo Striveworks Jordan Yono Upstage JuHyung Son Zillow Peilun Li","title":"Adopters"},{"location":"community/adopters/#adopters-of-kserve","text":"This page contains a list of organizations who are using KServe either in production, or providing integrations or deployment options with their Cloud or product offerings. If you'd like to be included here, please send a pull request which modifies this file. Please keep the list in alphabetical order. Organization Contact Advanced Micro Devices Varun Sharma Alauda Wu Yi Amazon Web Services Ellis Tarn Bloomberg Dan Sun Cars24 Swapnesh Khare Charmed Kubeflow from Canonical Daniela Plasencia Cisco Krishna Durai Cloudera Zoram Thanga CoreWeave Peter Salanki Deeploy Tim Kleinloog Gojek Willem Pienaar Halodoc ID Joinal Ahmed Hewlett Packard Enterprise (HPE) Jerry Harrow Hypermode Kevin Mingtarja IBM Nick Hill Inspur Qingshan Chen Intuit Rachit Chauhan Kubeflow on Google Cloud James Liu Max Kelsen Jacob O'Farrell Naver Mark Winter Nuance Jeff Griffith NVIDIA David Goodwin One Convergence Subra Ongole PITS Global Data Recovery Services Pheianox Red Hat Taneem Ibrahim Seldon Alex Housley Patterson Consulting Josh Patterson Samsung SDS Hanbae Seo Striveworks Jordan Yono Upstage JuHyung Son Zillow Peilun Li","title":"Adopters of KServe"},{"location":"community/get_involved/","text":"How to Get Involved \u00b6 Welcome to the KServe community! Feel free to ask questions, engage in discussions, or get involved in the KServe's development. KServe, as an open-source project, thrives on the active participation of its community. Let's work together to make machine learning model serving effortless. Join us! How do you want to get involved? \u00b6 Ask Questions \u00b6 For the fastest response, you can ask questions on the #kserve channel of the CNCF Slack . To Join the channel, Create your CNCF Slack account and Search for the #kserve channel or join via this link . If you prefer to use GitHub discussions, you can join the KServe discussions . Bug Reports and Feature Requests \u00b6 We use GitHub Issues to track bug reports and feature requests. Please file your issues and feature requests in the KServe main repository . For Documentation related issues, please use the KServe Website repository . For Open Inference Protocol (V2) related issues and feature requests, please use Open Inference Protocol repository A good bug report should include: Description: Clearly state what you were trying to accomplish and what behavior you observed instead Versions: Specify the versions of relevant components KServe version Knative version (If using Serverless) Kubeflow version (If used with Kubeflow) Kubernetes version Cloud provider details (if using a cloud provider, indicate which one) Relevant resource yaml, HTTP requests, or log lines Vulnerability Reports \u00b6 We strongly encourage you to report security vulnerabilities privately, before disclosing them in any public forums. Only the active maintainers and KServe security group members will receive the reported security vulnerabilities and the issues are treated as top priority. You can use the following ways to report security vulnerabilities privately: Using our private security mailing list: kserve-security@lists.lfaidata.foundation . Using the KServe repository GitHub Security Advisory Become a Contributor \u00b6 This is the place to start your journey as a contributor\u2014whether it's enhancing code, improving documentation. KServe welcomes your contribution! If you're interested in becoming a KServe contributor, you'll want to check out our developer guide . Communication Channels \u00b6 Much of the community meets on the CNCF Slack , using the following channels: #kserve : General discussion about KServe usage #kserve-contributors : General discussion channel for folks contributing to the KServe project in any capacity #kserve-oip-collaboration : Discussion area for Open Inference Protocol and API standardization Community Meetings \u00b6 We have public KServe WG biweekly community meetings on Wed 9AM US/Pacific and a public monthly Open Inference Protocol WG meeting on Wed 10AM US/Pacific. KServe WG Meeting agendas and notes can be accessed in the working group document . Open Inference Protocol WG meeting minutes from the monthly work group sessions can be accessed in the working group document . You can access the meeting recordings on the community calendar by clicking on the respective date's event details. Stay tuned for new events by subscribing to the community calendar ( iCal export file ).","title":"How to Get Involved"},{"location":"community/get_involved/#how-to-get-involved","text":"Welcome to the KServe community! Feel free to ask questions, engage in discussions, or get involved in the KServe's development. KServe, as an open-source project, thrives on the active participation of its community. Let's work together to make machine learning model serving effortless. Join us!","title":"How to Get Involved"},{"location":"community/get_involved/#how-do-you-want-to-get-involved","text":"","title":"How do you want to get involved?"},{"location":"community/get_involved/#ask-questions","text":"For the fastest response, you can ask questions on the #kserve channel of the CNCF Slack . To Join the channel, Create your CNCF Slack account and Search for the #kserve channel or join via this link . If you prefer to use GitHub discussions, you can join the KServe discussions .","title":"Ask Questions"},{"location":"community/get_involved/#bug-reports-and-feature-requests","text":"We use GitHub Issues to track bug reports and feature requests. Please file your issues and feature requests in the KServe main repository . For Documentation related issues, please use the KServe Website repository . For Open Inference Protocol (V2) related issues and feature requests, please use Open Inference Protocol repository A good bug report should include: Description: Clearly state what you were trying to accomplish and what behavior you observed instead Versions: Specify the versions of relevant components KServe version Knative version (If using Serverless) Kubeflow version (If used with Kubeflow) Kubernetes version Cloud provider details (if using a cloud provider, indicate which one) Relevant resource yaml, HTTP requests, or log lines","title":"Bug Reports and Feature Requests"},{"location":"community/get_involved/#vulnerability-reports","text":"We strongly encourage you to report security vulnerabilities privately, before disclosing them in any public forums. Only the active maintainers and KServe security group members will receive the reported security vulnerabilities and the issues are treated as top priority. You can use the following ways to report security vulnerabilities privately: Using our private security mailing list: kserve-security@lists.lfaidata.foundation . Using the KServe repository GitHub Security Advisory","title":"Vulnerability Reports"},{"location":"community/get_involved/#become-a-contributor","text":"This is the place to start your journey as a contributor\u2014whether it's enhancing code, improving documentation. KServe welcomes your contribution! If you're interested in becoming a KServe contributor, you'll want to check out our developer guide .","title":"Become a Contributor"},{"location":"community/get_involved/#communication-channels","text":"Much of the community meets on the CNCF Slack , using the following channels: #kserve : General discussion about KServe usage #kserve-contributors : General discussion channel for folks contributing to the KServe project in any capacity #kserve-oip-collaboration : Discussion area for Open Inference Protocol and API standardization","title":"Communication Channels"},{"location":"community/get_involved/#community-meetings","text":"We have public KServe WG biweekly community meetings on Wed 9AM US/Pacific and a public monthly Open Inference Protocol WG meeting on Wed 10AM US/Pacific. KServe WG Meeting agendas and notes can be accessed in the working group document . Open Inference Protocol WG meeting minutes from the monthly work group sessions can be accessed in the working group document . You can access the meeting recordings on the community calendar by clicking on the respective date's event details. Stay tuned for new events by subscribing to the community calendar ( iCal export file ).","title":"Community Meetings"},{"location":"community/presentations/","text":"KServe(Formally KFServing) Presentations and Demoes \u00b6 This page contains a list of presentations and demos. If you'd like to add a presentation or demo here, please send a pull request. Presentation/Demo Presenters Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes David Gray Engaging the KServe Community, The Impact of Integrating a Solutions with Standardized CNCF Projects Adam Tetelman, Taneem Ibrahim, Johnu George, Tessa Pham, Andreea Munteanu Advancing Cloud Native AI Innovation Through Open Collaboration Yuan Tang Unlocking Potential of Large Models in Production Yuan Tang, Adam Tetelman WG Serving: Accelerating AI/ML Inference Workloads on Kubernetes Yuan Tang, Eduardo Arango Gutierrez Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines Meenakshi Kaushik, Shiva Krishna Merla From Bash Scripts to Kubeflow and GitOps: Our Journey to Operationalizing ML at Scale Luca Grazioli, Dennis Ohrndorf Production-Ready AI Platform on Kubernetes Yuan Tang Fortifying AI Security in Kubernetes with Confidential Containers Suraj Deshmukh, Pradipta Banerjee Platform Building Blocks: How to Build ML Infrastructure with CNCF Projects Yuzhui Liu, Leon Zhou Distributed Machine Learning Patterns from Manning Publications Yuan Tang KubeCon 2019: Introducing KFServing: Serverless Model Serving on Kubernetes Dan Sun, Ellis Tarn KubeCon 2019: Advanced Model Inferencing Leveraging KNative, Istio & Kubeflow Serving Animesh Singh, Clive Cox KubeflowDojo: KFServing - Production Model Serving Platform Animesh Singh, Tommy Li NVIDIA: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing Dan Sun, David Goodwin KF Community: KFServing - Enabling Serverless Workloads Across Model Frameworks Ellis Tarn KubeflowDojo: Demo - KFServing End to End through Notebook Animesh Singh, Tommy Li KubeflowDojo: Demo - KFServing with Kafka and Kubeflow Pipelines Animesh Singh Anchor MLOps Podcast: Serving Models with KFServing David Aponte, Demetrios Brinkmann Kubeflow 101: What is KFServing? Stephanie Wong ICML 2020, Workshop on Challenges in Deploying and Monitoring Machine Learning Systems : Serverless inferencing on Kubernetes Clive Cox Serverless Practitioners Summit 2020: Serverless Machine Learning Inference with KFServing Clive Cox, Yuzhui Liu MLOps Meetup: KServe Live Coding Session Theofilos Papapanagiotou KubeCon AI Days 2021: Serving Machine Learning Models at Scale Using KServe Yuzhui Liu KubeCon 2021: Serving Machine Learning Models at Scale Using KServe Animesh Singh KubeCon China 2021: Accelerate Federated Learning Model Deployment with KServe Fangchi Wang & Jiahao Chen KubeCon AI Days 2022: Exploring ML Model Serving with KServe Alexa Nicole Griffith KubeCon AI Days 2022: Enhancing the Performance Testing Process for gRPC Model Inferencing at Scale Ted Chang, Paul Van Eck KubeCon Edge Days 2022: Model Serving at the Edge Made Easier Paul Van Eck KnativeCon 2022: How We Built an ML inference Platform with Knative Dan Sun KubeCon EU 2023: The state and future of cloud native model serving Dan Sun, Theofilos Papapanagiotou Kubeflow Summit 2023: Scale your Models to Zero with Knative and KServe Jooho Lee Kubeflow Summit 2023: What to choose? ModelMesh vs Model Serving? Vaibhav Jain","title":"Demos and Presentations"},{"location":"community/presentations/#kserveformally-kfserving-presentations-and-demoes","text":"This page contains a list of presentations and demos. If you'd like to add a presentation or demo here, please send a pull request. Presentation/Demo Presenters Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes David Gray Engaging the KServe Community, The Impact of Integrating a Solutions with Standardized CNCF Projects Adam Tetelman, Taneem Ibrahim, Johnu George, Tessa Pham, Andreea Munteanu Advancing Cloud Native AI Innovation Through Open Collaboration Yuan Tang Unlocking Potential of Large Models in Production Yuan Tang, Adam Tetelman WG Serving: Accelerating AI/ML Inference Workloads on Kubernetes Yuan Tang, Eduardo Arango Gutierrez Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines Meenakshi Kaushik, Shiva Krishna Merla From Bash Scripts to Kubeflow and GitOps: Our Journey to Operationalizing ML at Scale Luca Grazioli, Dennis Ohrndorf Production-Ready AI Platform on Kubernetes Yuan Tang Fortifying AI Security in Kubernetes with Confidential Containers Suraj Deshmukh, Pradipta Banerjee Platform Building Blocks: How to Build ML Infrastructure with CNCF Projects Yuzhui Liu, Leon Zhou Distributed Machine Learning Patterns from Manning Publications Yuan Tang KubeCon 2019: Introducing KFServing: Serverless Model Serving on Kubernetes Dan Sun, Ellis Tarn KubeCon 2019: Advanced Model Inferencing Leveraging KNative, Istio & Kubeflow Serving Animesh Singh, Clive Cox KubeflowDojo: KFServing - Production Model Serving Platform Animesh Singh, Tommy Li NVIDIA: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing Dan Sun, David Goodwin KF Community: KFServing - Enabling Serverless Workloads Across Model Frameworks Ellis Tarn KubeflowDojo: Demo - KFServing End to End through Notebook Animesh Singh, Tommy Li KubeflowDojo: Demo - KFServing with Kafka and Kubeflow Pipelines Animesh Singh Anchor MLOps Podcast: Serving Models with KFServing David Aponte, Demetrios Brinkmann Kubeflow 101: What is KFServing? Stephanie Wong ICML 2020, Workshop on Challenges in Deploying and Monitoring Machine Learning Systems : Serverless inferencing on Kubernetes Clive Cox Serverless Practitioners Summit 2020: Serverless Machine Learning Inference with KFServing Clive Cox, Yuzhui Liu MLOps Meetup: KServe Live Coding Session Theofilos Papapanagiotou KubeCon AI Days 2021: Serving Machine Learning Models at Scale Using KServe Yuzhui Liu KubeCon 2021: Serving Machine Learning Models at Scale Using KServe Animesh Singh KubeCon China 2021: Accelerate Federated Learning Model Deployment with KServe Fangchi Wang & Jiahao Chen KubeCon AI Days 2022: Exploring ML Model Serving with KServe Alexa Nicole Griffith KubeCon AI Days 2022: Enhancing the Performance Testing Process for gRPC Model Inferencing at Scale Ted Chang, Paul Van Eck KubeCon Edge Days 2022: Model Serving at the Edge Made Easier Paul Van Eck KnativeCon 2022: How We Built an ML inference Platform with Knative Dan Sun KubeCon EU 2023: The state and future of cloud native model serving Dan Sun, Theofilos Papapanagiotou Kubeflow Summit 2023: Scale your Models to Zero with Knative and KServe Jooho Lee Kubeflow Summit 2023: What to choose? ModelMesh vs Model Serving? Vaibhav Jain","title":"KServe(Formally KFServing) Presentations and Demoes"},{"location":"developer/debug/","text":"KServe Debugging Guide \u00b6 Debug KServe InferenceService Status \u00b6 You deployed an InferenceService to KServe, but it is not in ready state. Go through this step by step guide to understand what failed. kubectl get inferenceservices sklearn-iris NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE model-example False 1m IngressNotConfigured \u00b6 If you see IngressNotConfigured error, this indicates Istio Ingress Gateway probes are failing. kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.example.com sklearn-iris-predictor-default-jk794 mnist-sample-predictor-default-jk794 Unknown IngressNotConfigured You can then check Knative networking-istio pod logs for more details. kubectl logs -l app = networking-istio -n knative-serving If you are seeing HTTP 403, then you may have Istio RBAC turned on which blocks the probes to your service. { \"level\" : \"error\" , \"ts\" : \"2020-03-26T19:12:00.749Z\" , \"logger\" : \"istiocontroller.ingress-controller.status-manager\" , \"caller\" : \"ingress/status.go:366\" , \"msg\" : \"Probing of http://flowers-sample-predictor-default.kubeflow-jeanarmel-luce.example.com:80/ failed, IP: 10.0.0.29:80, ready: false, error: unexpected status code: want [200], got 403 (depth: 0)\" , \"commit\" : \"6b0e5c6\" , \"knative.dev/controller\" : \"ingress-controller\" , \"stacktrace\" : \"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:366\\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268\" } RevisionMissing Error \u00b6 If you see RevisionMissing error, then your service pods are not in ready state. Knative Service creates Knative Revision which represents a snapshot of the InferenceService code and configuration. Storage Initializer fails to download model \u00b6 kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-csjpw sklearn-iris-predictor-default sklearn-iris-predictor-default-csjpw 2 Unknown Deploying If you see READY status in Unknown error, this usually indicates that the KServe Storage Initializer init container fails to download the model and you can check the init container logs to see why it fails, note that the pod scales down after sometime if the init container fails . kubectl get pod -l serving.kserve.io/inferenceservice = sklearn-iris NAME READY STATUS RESTARTS AGE sklearn-iris-predictor-default-29jks-deployment-5f7d4b9996hzrnc 0 /3 Init:Error 1 10s kubectl logs -l model = sklearn-iris -c storage-initializer [ I 200517 03 :56:19 initializer-entrypoint:13 ] Initializing, args: src_uri [ gs://kfserving-examples/models/sklearn/iris-1 ] dest_path [ [ /mnt/models ] [ I 200517 03 :56:19 storage:35 ] Copying contents of gs://kfserving-examples/models/sklearn/iris-1 to local Traceback ( most recent call last ) : File \"/storage-initializer/scripts/initializer-entrypoint\" , line 14 , in kserve.Storage.download ( src_uri, dest_path ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 48 , in download Storage._download_gcs ( uri, out_dir ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 116 , in _download_gcs The path or model %s does not exist. \" % (uri)) RuntimeError: Failed to fetch model. The path or model gs://kfserving-examples/models/sklearn/iris-1 does not exist. [I 200517 03:40:19 initializer-entrypoint:13] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/iris] dest_path[ [/mnt/models] [I 200517 03:40:19 storage:35] Copying contents of gs://kfserving-examples/models/sklearn/iris to local [I 200517 03:40:20 storage:111] Downloading: /mnt/models/model.joblib [I 200517 03:40:20 storage:60] Successfully copied gs://kfserving-examples/models/sklearn/iris to /mnt/models Inference Service in OOM status \u00b6 If you see ExitCode137 from the revision status, this means the revision has failed and this usually happens when the inference service pod is out of memory. To address it, you might need to bump up the memory limit of the InferenceService . kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-84bzf sklearn-iris-predictor-default sklearn-iris-predictor-default-84bzf 8 False ExitCode137s Inference Service fails to start \u00b6 If you see other exit codes from the revision status you can further check the pod status. kubectl get pods -l serving.kserve.io/inferenceservice = sklearn-iris sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n 1 /3 CrashLoopBackOff 3 80s If you see the CrashLoopBackOff , then check the kserve-container log to see more details where it fails, the error log is usually propagated on revision container status also. kubectl logs sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n kserve-container [ I 200517 04 :58:21 storage:35 ] Copying contents of /mnt/models to local Traceback ( most recent call last ) : File \"/usr/local/lib/python3.7/runpy.py\" , line 193 , in _run_module_as_main \"__main__\" , mod_spec ) File \"/usr/local/lib/python3.7/runpy.py\" , line 85 , in _run_code exec ( code, run_globals ) File \"/sklearnserver/sklearnserver/__main__.py\" , line 33 , in model.load () File \"/sklearnserver/sklearnserver/model.py\" , line 36 , in load model_file = next ( path for path in paths if os.path.exists ( path )) StopIteration Inference Service cannot fetch docker images from AWS ECR \u00b6 If you don't see the inference service created at all for custom images from private registries (such as AWS ECR), it might be that the Knative Serving Controller fails to authenticate itself against the registry. failed to resolve image to digest: failed to fetch image information: unsupported status code 401 ; body: Not Authorized You can verify that this is actually the case by spinning up a pod that uses your image. The pod should be able to fetch it, if the correct IAM roles are attached, while Knative is not able to. To circumvent this issue you can either skip tag resolution or provide certificates for your registry as detailed in the official knative docs . kubectl -n knative-serving edit configmap config-deployment The resultant yaml will look like something below. apiVersion : v1 kind : ConfigMap metadata : name : config-deployment namespace : knative-serving data : # List of repositories for which tag to digest resolving should be skipped (for AWS ECR: {account_id}.dkr.ecr.{region}.amazonaws.com) registriesSkippingTagResolving : registry.example.com Debug KServe Request flow \u00b6 +----------------------+ +-----------------------+ +--------------------------+ |Istio Virtual Service | |Istio Virtual Service | | K8S Service | | | | | | | |sklearn-iris | |sklearn-iris-predictor | | sklearn-iris-predictor | | +------->|-default +----->| -default-$revision | | | | | | | |KServe Route | |Knative Route | | Knative Revision Service | +----------------------+ +-----------------------+ +------------+-------------+ Knative Ingress Gateway Knative Local Gateway Kube Proxy (Istio gateway) (Istio gateway) | | | +-------------------------------------------------------+ | | Knative Revision Pod | | | | | | +-------------------+ +-----------------+ | | | | | | | | | | |kserve-container |<-----+ Queue Proxy | |<------------------+ | | | | | | | +-------------------+ +--------------^--+ | | | | +-----------------------^-------------------------------+ | scale deployment | +--------+--------+ | pull metrics | Knative | | | Autoscaler |----------- | KPA/HPA | +-----------------+ 1.Traffic arrives through Knative Ingress/Local Gateway for external/internal traffic \u00b6 Istio Gateway resource describes the edge of the mesh receiving incoming or outgoing HTTP/TCP connections. The specification describes a set of ports that should be exposed and the type of protocol to use. If you are using Standalone mode, it installs the Gateway in knative-serving namespace, if you are using Kubeflow KServe (KServe installed with Kubeflow), it installs the Gateway in kubeflow namespace e.g on GCP the gateway is protected behind IAP with Istio authentication policy . kubectl get gateway knative-ingress-gateway -n knative-serving -oyaml kind : Gateway metadata : labels : networking.knative.dev/ingress-provider : istio serving.knative.dev/release : v0.12.1 name : knative-ingress-gateway namespace : knative-serving spec : selector : istio : ingressgateway servers : - hosts : - '*' port : name : http number : 80 protocol : HTTP - hosts : - '*' port : name : https number : 443 protocol : HTTPS tls : mode : SIMPLE privateKey : /etc/istio/ingressgateway-certs/tls.key serverCertificate : /etc/istio/ingressgateway-certs/tls.crt The InferenceService request routes to the Istio Ingress Gateway by matching the host and port from the url, by default http is configured, you can configure HTTPS with TLS certificates . 2. KServe Istio virtual service to route for predictor, transformer, explainer. \u00b6 kubectl get vs sklearn-iris -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris namespace : default gateways : - knative-serving/knative-local-gateway - knative-serving/knative-ingress-gateway hosts : - sklearn-iris.default.svc.cluster.local - sklearn-iris.default.example.com http : - headers : request : set : Host : sklearn-iris-predictor-default.default.svc.cluster.local match : - authority : regex : ^sklearn-iris\\.default(\\.svc(\\.cluster\\.local)?)?(?::\\d{1,5})?$ gateways : - knative-serving/knative-local-gateway - authority : regex : ^sklearn-iris\\.default\\.example\\.com(?::\\d{1,5})?$ gateways : - knative-serving/knative-ingress-gateway route : - destination : host : knative-local-gateway.istio-system.svc.cluster.local port : number : 80 weight : 100 KServe creates the routing rule which by default routes to Predictor if you only have Predictor specified on InferenceService . When Transformer and Explainer are specified on InferenceService the routing rule configures the traffic to route to Transformer or Explainer based on the verb. The request then routes to the second level Knative created virtual service via local gateway with the matching host header. 3. Knative Istio virtual service to route the inference request to the latest ready revision. \u00b6 kubectl get vs sklearn-iris-predictor-default-ingress -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris-predictor-default-mesh namespace : default spec : gateways : - knative-serving/knative-ingress-gateway - knative-serving/knative-local-gateway hosts : - sklearn-iris-predictor-default.default - sklearn-iris-predictor-default.default.example.com - sklearn-iris-predictor-default.default.svc - sklearn-iris-predictor-default.default.svc.cluster.local http : - match : - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default.svc gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 - match : - authority : prefix : sklearn-iris-predictor-default.default.example.com gateways : - knative-serving/knative-ingress-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 The destination here is the k8s Service for the latest ready Knative Revision and it is reconciled by Knative every time user rolls out a new revision. When a new revision is rolled out and in ready state, the old revision is then scaled down, after configured revision GC time the revision resource is garbage collected if the revision no longer has traffic referenced. 4. Kubernetes Service routes the requests to the queue proxy sidecar of the inference service pod on port 8012 . \u00b6 kubectl get svc sklearn-iris-predictor-default-fhmjk-private -oyaml apiVersion : v1 kind : Service metadata : name : sklearn-iris-predictor-default-fhmjk-private namespace : default spec : clusterIP : 10.105.186.18 ports : - name : http port : 80 protocol : TCP targetPort : 8012 - name : queue-metrics port : 9090 protocol : TCP targetPort : queue-metrics - name : http-usermetric port : 9091 protocol : TCP targetPort : http-usermetric - name : http-queueadm port : 8022 protocol : TCP targetPort : 8022 selector : serving.knative.dev/revisionUID : a8f1eafc-3c64-4930-9a01-359f3235333a sessionAffinity : None type : ClusterIP 5. The queue proxy routes to kserve container with max concurrent requests configured with ContainerConcurrency . \u00b6 If the queue proxy has more requests than it can handle, the Knative Autoscaler creates more pods to handle additional requests. 6. Finally The queue proxy routes traffic to the kserve-container for processing the inference requests. \u00b6","title":"Debugging guide"},{"location":"developer/debug/#kserve-debugging-guide","text":"","title":"KServe Debugging Guide"},{"location":"developer/debug/#debug-kserve-inferenceservice-status","text":"You deployed an InferenceService to KServe, but it is not in ready state. Go through this step by step guide to understand what failed. kubectl get inferenceservices sklearn-iris NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE model-example False 1m","title":"Debug KServe InferenceService Status"},{"location":"developer/debug/#ingressnotconfigured","text":"If you see IngressNotConfigured error, this indicates Istio Ingress Gateway probes are failing. kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.example.com sklearn-iris-predictor-default-jk794 mnist-sample-predictor-default-jk794 Unknown IngressNotConfigured You can then check Knative networking-istio pod logs for more details. kubectl logs -l app = networking-istio -n knative-serving If you are seeing HTTP 403, then you may have Istio RBAC turned on which blocks the probes to your service. { \"level\" : \"error\" , \"ts\" : \"2020-03-26T19:12:00.749Z\" , \"logger\" : \"istiocontroller.ingress-controller.status-manager\" , \"caller\" : \"ingress/status.go:366\" , \"msg\" : \"Probing of http://flowers-sample-predictor-default.kubeflow-jeanarmel-luce.example.com:80/ failed, IP: 10.0.0.29:80, ready: false, error: unexpected status code: want [200], got 403 (depth: 0)\" , \"commit\" : \"6b0e5c6\" , \"knative.dev/controller\" : \"ingress-controller\" , \"stacktrace\" : \"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:366\\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\\n\\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268\" }","title":"IngressNotConfigured"},{"location":"developer/debug/#revisionmissing-error","text":"If you see RevisionMissing error, then your service pods are not in ready state. Knative Service creates Knative Revision which represents a snapshot of the InferenceService code and configuration.","title":"RevisionMissing Error"},{"location":"developer/debug/#storage-initializer-fails-to-download-model","text":"kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-csjpw sklearn-iris-predictor-default sklearn-iris-predictor-default-csjpw 2 Unknown Deploying If you see READY status in Unknown error, this usually indicates that the KServe Storage Initializer init container fails to download the model and you can check the init container logs to see why it fails, note that the pod scales down after sometime if the init container fails . kubectl get pod -l serving.kserve.io/inferenceservice = sklearn-iris NAME READY STATUS RESTARTS AGE sklearn-iris-predictor-default-29jks-deployment-5f7d4b9996hzrnc 0 /3 Init:Error 1 10s kubectl logs -l model = sklearn-iris -c storage-initializer [ I 200517 03 :56:19 initializer-entrypoint:13 ] Initializing, args: src_uri [ gs://kfserving-examples/models/sklearn/iris-1 ] dest_path [ [ /mnt/models ] [ I 200517 03 :56:19 storage:35 ] Copying contents of gs://kfserving-examples/models/sklearn/iris-1 to local Traceback ( most recent call last ) : File \"/storage-initializer/scripts/initializer-entrypoint\" , line 14 , in kserve.Storage.download ( src_uri, dest_path ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 48 , in download Storage._download_gcs ( uri, out_dir ) File \"/usr/local/lib/python3.7/site-packages/kfserving/storage.py\" , line 116 , in _download_gcs The path or model %s does not exist. \" % (uri)) RuntimeError: Failed to fetch model. The path or model gs://kfserving-examples/models/sklearn/iris-1 does not exist. [I 200517 03:40:19 initializer-entrypoint:13] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/iris] dest_path[ [/mnt/models] [I 200517 03:40:19 storage:35] Copying contents of gs://kfserving-examples/models/sklearn/iris to local [I 200517 03:40:20 storage:111] Downloading: /mnt/models/model.joblib [I 200517 03:40:20 storage:60] Successfully copied gs://kfserving-examples/models/sklearn/iris to /mnt/models","title":"Storage Initializer fails to download model"},{"location":"developer/debug/#inference-service-in-oom-status","text":"If you see ExitCode137 from the revision status, this means the revision has failed and this usually happens when the inference service pod is out of memory. To address it, you might need to bump up the memory limit of the InferenceService . kubectl get revision $( kubectl get configuration sklearn-iris-predictor-default --output jsonpath = \"{.status.latestCreatedRevisionName}\" ) NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON sklearn-iris-predictor-default-84bzf sklearn-iris-predictor-default sklearn-iris-predictor-default-84bzf 8 False ExitCode137s","title":"Inference Service in OOM status"},{"location":"developer/debug/#inference-service-fails-to-start","text":"If you see other exit codes from the revision status you can further check the pod status. kubectl get pods -l serving.kserve.io/inferenceservice = sklearn-iris sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n 1 /3 CrashLoopBackOff 3 80s If you see the CrashLoopBackOff , then check the kserve-container log to see more details where it fails, the error log is usually propagated on revision container status also. kubectl logs sklearn-iris-predictor-default-rvhmk-deployment-867c6444647tz7n kserve-container [ I 200517 04 :58:21 storage:35 ] Copying contents of /mnt/models to local Traceback ( most recent call last ) : File \"/usr/local/lib/python3.7/runpy.py\" , line 193 , in _run_module_as_main \"__main__\" , mod_spec ) File \"/usr/local/lib/python3.7/runpy.py\" , line 85 , in _run_code exec ( code, run_globals ) File \"/sklearnserver/sklearnserver/__main__.py\" , line 33 , in model.load () File \"/sklearnserver/sklearnserver/model.py\" , line 36 , in load model_file = next ( path for path in paths if os.path.exists ( path )) StopIteration","title":"Inference Service fails to start"},{"location":"developer/debug/#inference-service-cannot-fetch-docker-images-from-aws-ecr","text":"If you don't see the inference service created at all for custom images from private registries (such as AWS ECR), it might be that the Knative Serving Controller fails to authenticate itself against the registry. failed to resolve image to digest: failed to fetch image information: unsupported status code 401 ; body: Not Authorized You can verify that this is actually the case by spinning up a pod that uses your image. The pod should be able to fetch it, if the correct IAM roles are attached, while Knative is not able to. To circumvent this issue you can either skip tag resolution or provide certificates for your registry as detailed in the official knative docs . kubectl -n knative-serving edit configmap config-deployment The resultant yaml will look like something below. apiVersion : v1 kind : ConfigMap metadata : name : config-deployment namespace : knative-serving data : # List of repositories for which tag to digest resolving should be skipped (for AWS ECR: {account_id}.dkr.ecr.{region}.amazonaws.com) registriesSkippingTagResolving : registry.example.com","title":"Inference Service cannot fetch docker images from AWS ECR"},{"location":"developer/debug/#debug-kserve-request-flow","text":"+----------------------+ +-----------------------+ +--------------------------+ |Istio Virtual Service | |Istio Virtual Service | | K8S Service | | | | | | | |sklearn-iris | |sklearn-iris-predictor | | sklearn-iris-predictor | | +------->|-default +----->| -default-$revision | | | | | | | |KServe Route | |Knative Route | | Knative Revision Service | +----------------------+ +-----------------------+ +------------+-------------+ Knative Ingress Gateway Knative Local Gateway Kube Proxy (Istio gateway) (Istio gateway) | | | +-------------------------------------------------------+ | | Knative Revision Pod | | | | | | +-------------------+ +-----------------+ | | | | | | | | | | |kserve-container |<-----+ Queue Proxy | |<------------------+ | | | | | | | +-------------------+ +--------------^--+ | | | | +-----------------------^-------------------------------+ | scale deployment | +--------+--------+ | pull metrics | Knative | | | Autoscaler |----------- | KPA/HPA | +-----------------+","title":"Debug KServe Request flow"},{"location":"developer/debug/#1traffic-arrives-through-knative-ingresslocal-gateway-for-externalinternal-traffic","text":"Istio Gateway resource describes the edge of the mesh receiving incoming or outgoing HTTP/TCP connections. The specification describes a set of ports that should be exposed and the type of protocol to use. If you are using Standalone mode, it installs the Gateway in knative-serving namespace, if you are using Kubeflow KServe (KServe installed with Kubeflow), it installs the Gateway in kubeflow namespace e.g on GCP the gateway is protected behind IAP with Istio authentication policy . kubectl get gateway knative-ingress-gateway -n knative-serving -oyaml kind : Gateway metadata : labels : networking.knative.dev/ingress-provider : istio serving.knative.dev/release : v0.12.1 name : knative-ingress-gateway namespace : knative-serving spec : selector : istio : ingressgateway servers : - hosts : - '*' port : name : http number : 80 protocol : HTTP - hosts : - '*' port : name : https number : 443 protocol : HTTPS tls : mode : SIMPLE privateKey : /etc/istio/ingressgateway-certs/tls.key serverCertificate : /etc/istio/ingressgateway-certs/tls.crt The InferenceService request routes to the Istio Ingress Gateway by matching the host and port from the url, by default http is configured, you can configure HTTPS with TLS certificates .","title":"1.Traffic arrives through Knative Ingress/Local Gateway for external/internal traffic"},{"location":"developer/debug/#2-kserve-istio-virtual-service-to-route-for-predictor-transformer-explainer","text":"kubectl get vs sklearn-iris -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris namespace : default gateways : - knative-serving/knative-local-gateway - knative-serving/knative-ingress-gateway hosts : - sklearn-iris.default.svc.cluster.local - sklearn-iris.default.example.com http : - headers : request : set : Host : sklearn-iris-predictor-default.default.svc.cluster.local match : - authority : regex : ^sklearn-iris\\.default(\\.svc(\\.cluster\\.local)?)?(?::\\d{1,5})?$ gateways : - knative-serving/knative-local-gateway - authority : regex : ^sklearn-iris\\.default\\.example\\.com(?::\\d{1,5})?$ gateways : - knative-serving/knative-ingress-gateway route : - destination : host : knative-local-gateway.istio-system.svc.cluster.local port : number : 80 weight : 100 KServe creates the routing rule which by default routes to Predictor if you only have Predictor specified on InferenceService . When Transformer and Explainer are specified on InferenceService the routing rule configures the traffic to route to Transformer or Explainer based on the verb. The request then routes to the second level Knative created virtual service via local gateway with the matching host header.","title":"2. KServe Istio virtual service to route for predictor, transformer, explainer."},{"location":"developer/debug/#3-knative-istio-virtual-service-to-route-the-inference-request-to-the-latest-ready-revision","text":"kubectl get vs sklearn-iris-predictor-default-ingress -oyaml apiVersion : networking.istio.io/v1alpha3 kind : VirtualService metadata : name : sklearn-iris-predictor-default-mesh namespace : default spec : gateways : - knative-serving/knative-ingress-gateway - knative-serving/knative-local-gateway hosts : - sklearn-iris-predictor-default.default - sklearn-iris-predictor-default.default.example.com - sklearn-iris-predictor-default.default.svc - sklearn-iris-predictor-default.default.svc.cluster.local http : - match : - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default.svc gateways : - knative-serving/knative-local-gateway - authority : prefix : sklearn-iris-predictor-default.default gateways : - knative-serving/knative-local-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 - match : - authority : prefix : sklearn-iris-predictor-default.default.example.com gateways : - knative-serving/knative-ingress-gateway retries : {} route : - destination : host : sklearn-iris-predictor-default-00001.default.svc.cluster.local port : number : 80 headers : request : set : Knative-Serving-Namespace : default Knative-Serving-Revision : sklearn-iris-predictor-default-00001 weight : 100 The destination here is the k8s Service for the latest ready Knative Revision and it is reconciled by Knative every time user rolls out a new revision. When a new revision is rolled out and in ready state, the old revision is then scaled down, after configured revision GC time the revision resource is garbage collected if the revision no longer has traffic referenced.","title":"3. Knative Istio virtual service to route the inference request to the latest ready revision."},{"location":"developer/debug/#4-kubernetes-service-routes-the-requests-to-the-queue-proxy-sidecar-of-the-inference-service-pod-on-port-8012","text":"kubectl get svc sklearn-iris-predictor-default-fhmjk-private -oyaml apiVersion : v1 kind : Service metadata : name : sklearn-iris-predictor-default-fhmjk-private namespace : default spec : clusterIP : 10.105.186.18 ports : - name : http port : 80 protocol : TCP targetPort : 8012 - name : queue-metrics port : 9090 protocol : TCP targetPort : queue-metrics - name : http-usermetric port : 9091 protocol : TCP targetPort : http-usermetric - name : http-queueadm port : 8022 protocol : TCP targetPort : 8022 selector : serving.knative.dev/revisionUID : a8f1eafc-3c64-4930-9a01-359f3235333a sessionAffinity : None type : ClusterIP","title":"4. Kubernetes Service routes the requests to the queue proxy sidecar of the inference service pod on port 8012."},{"location":"developer/debug/#5-the-queue-proxy-routes-to-kserve-container-with-max-concurrent-requests-configured-with-containerconcurrency","text":"If the queue proxy has more requests than it can handle, the Knative Autoscaler creates more pods to handle additional requests.","title":"5. The queue proxy routes to kserve container with max concurrent requests configured with ContainerConcurrency."},{"location":"developer/debug/#6-finally-the-queue-proxy-routes-traffic-to-the-kserve-container-for-processing-the-inference-requests","text":"","title":"6. Finally The queue proxy routes traffic to the kserve-container for processing the inference requests."},{"location":"developer/developer/","text":"Development \u00b6 This doc explains how to setup a development environment so you can get started contributing . Prerequisites \u00b6 Follow the instructions below to set up your development environment. Once you meet these requirements, you can make changes and deploy your own version of kserve ! Before submitting a PR, see also CONTRIBUTING.md . Install requirements \u00b6 You must install these tools: go : KServe controller is written in Go and requires Go 1.20.0+. git : For source control. Go Module : Go's new dependency management system. ko : For development. kubectl : For managing development environments. kustomize To customize YAMLs for different environments, requires v5.0.0+. yq yq is used in the project makefiles to parse and display YAML output, requires yq 4.* . Install Knative on a Kubernetes cluster \u00b6 KServe currently requires Knative Serving for auto-scaling, canary rollout, Istio for traffic routing and ingress. To install Knative components on your Kubernetes cluster, follow the installation guide or alternatively, use the Knative Operators to manage your installation. Observability, tracing and logging are optional but are often very valuable tools for troubleshooting difficult issues, they can be installed via the directions here . If you start from scratch, KServe requires Kubernetes 1.25+, Knative 1.7+, Istio 1.15+. If you already have Istio or Knative (e.g. from a Kubeflow install) then you don't need to install them explicitly, as long as version dependencies are satisfied. Note On a local environment, when using minikube or kind as Kubernetes cluster, there has been a reported issue that knative quickstart bootstrap does not work as expected. It is recommended to follow the installation manual from knative using yaml or using knative operator for a better result. Setup your environment \u00b6 To start your environment you'll need to set these environment variables (we recommend adding them to your .bashrc ): GOPATH : If you don't have one, simply pick a directory and add export GOPATH=... $GOPATH/bin on PATH : This is so that tooling installed via go get will work properly. KO_DEFAULTPLATFORMS : If you are using M1 Mac book the value is linux/arm64 . KO_DOCKER_REPO : The docker repository to which developer images should be pushed (e.g. docker.io/ ). Note : Set up a docker repository for pushing images. You can use any container image registry by adjusting the authentication methods and repository paths mentioned in the sections below. Google Container Registry quickstart Docker Hub quickstart Azure Container Registry quickstart Note if you are using docker hub to store your images your KO_DOCKER_REPO variable should be docker.io/ . Currently Docker Hub doesn't let you create subdirs under your username. .bashrc example: export GOPATH = \" $HOME /go\" export PATH = \" ${ PATH } : ${ GOPATH } /bin\" export KO_DOCKER_REPO = 'docker.io/' Checkout your fork \u00b6 The Go tools require that you clone the repository to the src/github.com/kserve/kserve directory in your GOPATH . To check out this repository: Create your own fork of this repo Clone it to your machine: mkdir -p ${ GOPATH } /src/github.com/kserve cd ${ GOPATH } /src/github.com/kserve git clone git@github.com: ${ YOUR_GITHUB_USERNAME } /kserve.git cd kserve git remote add upstream git@github.com:kserve/kserve.git git remote set-url --push upstream no_push Adding the upstream remote sets you up nicely for regularly syncing your fork . Once you reach this point you are ready to do a full build and deploy as described below. Deploy KServe \u00b6 Check Knative Serving installation \u00b6 Once you've setup your development environment , you can verify the installation with following: Success $ kubectl -n knative-serving get pods NAME READY STATUS RESTARTS AGE activator-77784645fc-t2pjf 1 /1 Running 0 11d autoscaler-6fddf74d5-z2fzf 1 /1 Running 0 11d autoscaler-hpa-5bf4476cc5-tsbw6 1 /1 Running 0 11d controller-7b8cd7f95c-6jxxj 1 /1 Running 0 11d istio-webhook-866c5bc7f8-t5ztb 1 /1 Running 0 11d networking-istio-54fb8b5d4b-xznwd 1 /1 Running 0 11d webhook-5f5f7bd9b4-cv27c 1 /1 Running 0 11d $ kubectl get gateway -n knative-serving NAME AGE knative-ingress-gateway 11d knative-local-gateway 11d $ kubectl get svc -n istio-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .101.196.89 X.X.X.X 15021 :31101/TCP,80:31781/TCP,443:30372/TCP,15443:31067/TCP 11d istiod ClusterIP 10 .101.116.203 15010 /TCP,15012/TCP,443/TCP,15014/TCP,853/TCP 11d Deploy KServe from master branch \u00b6 We suggest using cert manager for provisioning the certificates for the webhook server. Other solutions should also work as long as they put the certificates in the desired location. You can follow the cert manager documentation to install it. If you don't want to install cert manager, you can set the KSERVE_ENABLE_SELF_SIGNED_CA environment variable to true. KSERVE_ENABLE_SELF_SIGNED_CA will execute a script to create a self-signed CA and patch it to the webhook config. export KSERVE_ENABLE_SELF_SIGNED_CA = true After that you can run following command to deploy KServe , you can skip above step if cert manager is already installed. make deploy Optional you can change CPU and memory limits when deploying KServe . export KSERVE_CONTROLLER_CPU_LIMIT = export KSERVE_CONTROLLER_MEMORY_LIMIT = make deploy Expected Output $ kubectl get pods -n kserve -l control-plane = kserve-controller-manager NAME READY STATUS RESTARTS AGE kserve-controller-manager-0 2/2 Running 0 13m Note By default it installs to kserve namespace with the published controller manager image from master branch. Deploy KServe with your own version \u00b6 Run the following command to deploy KServe controller and model agent with your local change. make deploy-dev Note deploy-dev builds the image from your local code, publishes to KO_DOCKER_REPO and deploys the kserve-controller-manager and model agent with the image digest to your cluster for testing. Please also ensure you are logged in to KO_DOCKER_REPO from your client machine. Run the following command to deploy model server with your local change. make deploy-dev-sklearn make deploy-dev-xgb Run the following command to deploy explainer with your local change. make deploy-dev-alibi Run the following command to deploy storage initializer with your local change. make deploy-dev-storageInitializer Warning The deploy command publishes the image to KO_DOCKER_REPO with the version latest , it changes the InferenceService configmap to point to the newly built image sha. The built image is only for development and testing purpose, the current limitation is that it changes the image impacted and reset all other images including the kserver-controller-manager to use the default ones. Smoke test after deployment \u00b6 Run the following command to smoke test the deployment kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/tensorflow/tensorflow.yaml You should see model serving deployment running under default or your specified namespace. $ kubectl get pods -n default -l serving.kserve.io/inferenceservice=flower-sample Expected Output NAME READY STATUS RESTARTS AGE flower-sample-default-htz8r-deployment-8fd979f9b-w2qbv 3/3 Running 0 10s Running unit/integration tests \u00b6 kserver-controller-manager has a few integration tests which requires mock apiserver and etcd, they get installed along with kubebuilder . To run all unit/integration tests: make test Run e2e tests locally \u00b6 To setup from local code, do: ./hack/quick_install.sh make undeploy make deploy-dev Go to python/kserve and install kserve python sdk deps pip3 install -e . [ test ] Then go to test/e2e . Run kubectl create namespace kserve-ci-e2e-test For KIND/minikube: Run export KSERVE_INGRESS_HOST_PORT=localhost:8080 In a different window run kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80 Note that not all tests will pass as the pytorch test requires gpu. These will show as pending pods at the end or you can add marker to skip the test. Run pytest > testresults.txt Tests may not clean up. To re-run, first do kubectl delete namespace kserve-ci-e2e-test , recreate namespace and run again. Iterating \u00b6 As you make changes to the code-base, there are two special cases to be aware of: If you change an input to generated code , then you must run make manifests . Inputs include: API type definitions in apis/serving Manifests or kustomize patches stored in config . To generate the KServe python/go clients, you should run make generate . If you want to add new dependencies , then you add the imports and the specific version of the dependency module in go.mod . When it encounters an import of a package not provided by any module in go.mod , the go command automatically looks up the module containing the package and adds it to go.mod using the latest version. If you want to upgrade the dependency , then you run go get command e.g go get golang.org/x/text to upgrade to the latest version, go get golang.org/x/text@v0.3.0 to upgrade to a specific version. make deploy-dev Contribute to the code \u00b6 See the guidelines for contributing a feature contributing to an existing issue Releases \u00b6 Please check out the documentation here to understand the release schedule and process. Feedback \u00b6 The best place to provide feedback about the KServe code is via a Github issue. See creating a Github issue for guidelines on submitting bugs and feature requests.","title":"How to contribute"},{"location":"developer/developer/#development","text":"This doc explains how to setup a development environment so you can get started contributing .","title":"Development"},{"location":"developer/developer/#prerequisites","text":"Follow the instructions below to set up your development environment. Once you meet these requirements, you can make changes and deploy your own version of kserve ! Before submitting a PR, see also CONTRIBUTING.md .","title":"Prerequisites"},{"location":"developer/developer/#install-requirements","text":"You must install these tools: go : KServe controller is written in Go and requires Go 1.20.0+. git : For source control. Go Module : Go's new dependency management system. ko : For development. kubectl : For managing development environments. kustomize To customize YAMLs for different environments, requires v5.0.0+. yq yq is used in the project makefiles to parse and display YAML output, requires yq 4.* .","title":"Install requirements"},{"location":"developer/developer/#install-knative-on-a-kubernetes-cluster","text":"KServe currently requires Knative Serving for auto-scaling, canary rollout, Istio for traffic routing and ingress. To install Knative components on your Kubernetes cluster, follow the installation guide or alternatively, use the Knative Operators to manage your installation. Observability, tracing and logging are optional but are often very valuable tools for troubleshooting difficult issues, they can be installed via the directions here . If you start from scratch, KServe requires Kubernetes 1.25+, Knative 1.7+, Istio 1.15+. If you already have Istio or Knative (e.g. from a Kubeflow install) then you don't need to install them explicitly, as long as version dependencies are satisfied. Note On a local environment, when using minikube or kind as Kubernetes cluster, there has been a reported issue that knative quickstart bootstrap does not work as expected. It is recommended to follow the installation manual from knative using yaml or using knative operator for a better result.","title":"Install Knative on a Kubernetes cluster"},{"location":"developer/developer/#setup-your-environment","text":"To start your environment you'll need to set these environment variables (we recommend adding them to your .bashrc ): GOPATH : If you don't have one, simply pick a directory and add export GOPATH=... $GOPATH/bin on PATH : This is so that tooling installed via go get will work properly. KO_DEFAULTPLATFORMS : If you are using M1 Mac book the value is linux/arm64 . KO_DOCKER_REPO : The docker repository to which developer images should be pushed (e.g. docker.io/ ). Note : Set up a docker repository for pushing images. You can use any container image registry by adjusting the authentication methods and repository paths mentioned in the sections below. Google Container Registry quickstart Docker Hub quickstart Azure Container Registry quickstart Note if you are using docker hub to store your images your KO_DOCKER_REPO variable should be docker.io/ . Currently Docker Hub doesn't let you create subdirs under your username. .bashrc example: export GOPATH = \" $HOME /go\" export PATH = \" ${ PATH } : ${ GOPATH } /bin\" export KO_DOCKER_REPO = 'docker.io/'","title":"Setup your environment"},{"location":"developer/developer/#checkout-your-fork","text":"The Go tools require that you clone the repository to the src/github.com/kserve/kserve directory in your GOPATH . To check out this repository: Create your own fork of this repo Clone it to your machine: mkdir -p ${ GOPATH } /src/github.com/kserve cd ${ GOPATH } /src/github.com/kserve git clone git@github.com: ${ YOUR_GITHUB_USERNAME } /kserve.git cd kserve git remote add upstream git@github.com:kserve/kserve.git git remote set-url --push upstream no_push Adding the upstream remote sets you up nicely for regularly syncing your fork . Once you reach this point you are ready to do a full build and deploy as described below.","title":"Checkout your fork"},{"location":"developer/developer/#deploy-kserve","text":"","title":"Deploy KServe"},{"location":"developer/developer/#check-knative-serving-installation","text":"Once you've setup your development environment , you can verify the installation with following: Success $ kubectl -n knative-serving get pods NAME READY STATUS RESTARTS AGE activator-77784645fc-t2pjf 1 /1 Running 0 11d autoscaler-6fddf74d5-z2fzf 1 /1 Running 0 11d autoscaler-hpa-5bf4476cc5-tsbw6 1 /1 Running 0 11d controller-7b8cd7f95c-6jxxj 1 /1 Running 0 11d istio-webhook-866c5bc7f8-t5ztb 1 /1 Running 0 11d networking-istio-54fb8b5d4b-xznwd 1 /1 Running 0 11d webhook-5f5f7bd9b4-cv27c 1 /1 Running 0 11d $ kubectl get gateway -n knative-serving NAME AGE knative-ingress-gateway 11d knative-local-gateway 11d $ kubectl get svc -n istio-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .101.196.89 X.X.X.X 15021 :31101/TCP,80:31781/TCP,443:30372/TCP,15443:31067/TCP 11d istiod ClusterIP 10 .101.116.203 15010 /TCP,15012/TCP,443/TCP,15014/TCP,853/TCP 11d","title":"Check Knative Serving installation"},{"location":"developer/developer/#deploy-kserve-from-master-branch","text":"We suggest using cert manager for provisioning the certificates for the webhook server. Other solutions should also work as long as they put the certificates in the desired location. You can follow the cert manager documentation to install it. If you don't want to install cert manager, you can set the KSERVE_ENABLE_SELF_SIGNED_CA environment variable to true. KSERVE_ENABLE_SELF_SIGNED_CA will execute a script to create a self-signed CA and patch it to the webhook config. export KSERVE_ENABLE_SELF_SIGNED_CA = true After that you can run following command to deploy KServe , you can skip above step if cert manager is already installed. make deploy Optional you can change CPU and memory limits when deploying KServe . export KSERVE_CONTROLLER_CPU_LIMIT = export KSERVE_CONTROLLER_MEMORY_LIMIT = make deploy Expected Output $ kubectl get pods -n kserve -l control-plane = kserve-controller-manager NAME READY STATUS RESTARTS AGE kserve-controller-manager-0 2/2 Running 0 13m Note By default it installs to kserve namespace with the published controller manager image from master branch.","title":"Deploy KServe from master branch"},{"location":"developer/developer/#deploy-kserve-with-your-own-version","text":"Run the following command to deploy KServe controller and model agent with your local change. make deploy-dev Note deploy-dev builds the image from your local code, publishes to KO_DOCKER_REPO and deploys the kserve-controller-manager and model agent with the image digest to your cluster for testing. Please also ensure you are logged in to KO_DOCKER_REPO from your client machine. Run the following command to deploy model server with your local change. make deploy-dev-sklearn make deploy-dev-xgb Run the following command to deploy explainer with your local change. make deploy-dev-alibi Run the following command to deploy storage initializer with your local change. make deploy-dev-storageInitializer Warning The deploy command publishes the image to KO_DOCKER_REPO with the version latest , it changes the InferenceService configmap to point to the newly built image sha. The built image is only for development and testing purpose, the current limitation is that it changes the image impacted and reset all other images including the kserver-controller-manager to use the default ones.","title":"Deploy KServe with your own version"},{"location":"developer/developer/#smoke-test-after-deployment","text":"Run the following command to smoke test the deployment kubectl apply -f https://raw.githubusercontent.com/kserve/kserve/master/docs/samples/v1beta1/tensorflow/tensorflow.yaml You should see model serving deployment running under default or your specified namespace. $ kubectl get pods -n default -l serving.kserve.io/inferenceservice=flower-sample Expected Output NAME READY STATUS RESTARTS AGE flower-sample-default-htz8r-deployment-8fd979f9b-w2qbv 3/3 Running 0 10s","title":"Smoke test after deployment"},{"location":"developer/developer/#running-unitintegration-tests","text":"kserver-controller-manager has a few integration tests which requires mock apiserver and etcd, they get installed along with kubebuilder . To run all unit/integration tests: make test","title":"Running unit/integration tests"},{"location":"developer/developer/#run-e2e-tests-locally","text":"To setup from local code, do: ./hack/quick_install.sh make undeploy make deploy-dev Go to python/kserve and install kserve python sdk deps pip3 install -e . [ test ] Then go to test/e2e . Run kubectl create namespace kserve-ci-e2e-test For KIND/minikube: Run export KSERVE_INGRESS_HOST_PORT=localhost:8080 In a different window run kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80 Note that not all tests will pass as the pytorch test requires gpu. These will show as pending pods at the end or you can add marker to skip the test. Run pytest > testresults.txt Tests may not clean up. To re-run, first do kubectl delete namespace kserve-ci-e2e-test , recreate namespace and run again.","title":"Run e2e tests locally"},{"location":"developer/developer/#iterating","text":"As you make changes to the code-base, there are two special cases to be aware of: If you change an input to generated code , then you must run make manifests . Inputs include: API type definitions in apis/serving Manifests or kustomize patches stored in config . To generate the KServe python/go clients, you should run make generate . If you want to add new dependencies , then you add the imports and the specific version of the dependency module in go.mod . When it encounters an import of a package not provided by any module in go.mod , the go command automatically looks up the module containing the package and adds it to go.mod using the latest version. If you want to upgrade the dependency , then you run go get command e.g go get golang.org/x/text to upgrade to the latest version, go get golang.org/x/text@v0.3.0 to upgrade to a specific version. make deploy-dev","title":"Iterating"},{"location":"developer/developer/#contribute-to-the-code","text":"See the guidelines for contributing a feature contributing to an existing issue","title":"Contribute to the code"},{"location":"developer/developer/#releases","text":"Please check out the documentation here to understand the release schedule and process.","title":"Releases"},{"location":"developer/developer/#feedback","text":"The best place to provide feedback about the KServe code is via a Github issue. See creating a Github issue for guidelines on submitting bugs and feature requests.","title":"Feedback"},{"location":"get_started/","text":"Getting Started with KServe \u00b6 Before you begin \u00b6 Warning KServe Quickstart Environments are for experimentation use only. For production installation, see our Administrator's Guide Before you can get started with a KServe Quickstart deployment you must install kind and the Kubernetes CLI. Install Kind (Kubernetes in Docker) \u00b6 You can use kind (Kubernetes in Docker) to run a local Kubernetes cluster with Docker container nodes. Install the Kubernetes CLI \u00b6 The Kubernetes CLI ( kubectl ) , allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. Install the KServe \"Quickstart\" environment \u00b6 After having kind installed, create a kind cluster with: kind create cluster Then run: kubectl config get-contexts It should list out a list of contexts you have, one of them should be kind-kind . Then run: kubectl config use-context kind-kind to use this context. You can then get started with a local deployment of KServe by using KServe Quick installation script on Kind : curl -s \"https://raw.githubusercontent.com/kserve/kserve/release-0.13/hack/quick_install.sh\" | bash or install via our published Helm Charts: helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.13.0 helm install kserve oci://ghcr.io/kserve/charts/kserve --version v0.13.0","title":"KServe Quickstart"},{"location":"get_started/#getting-started-with-kserve","text":"","title":"Getting Started with KServe"},{"location":"get_started/#before-you-begin","text":"Warning KServe Quickstart Environments are for experimentation use only. For production installation, see our Administrator's Guide Before you can get started with a KServe Quickstart deployment you must install kind and the Kubernetes CLI.","title":"Before you begin"},{"location":"get_started/#install-kind-kubernetes-in-docker","text":"You can use kind (Kubernetes in Docker) to run a local Kubernetes cluster with Docker container nodes.","title":"Install Kind (Kubernetes in Docker)"},{"location":"get_started/#install-the-kubernetes-cli","text":"The Kubernetes CLI ( kubectl ) , allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.","title":"Install the Kubernetes CLI"},{"location":"get_started/#install-the-kserve-quickstart-environment","text":"After having kind installed, create a kind cluster with: kind create cluster Then run: kubectl config get-contexts It should list out a list of contexts you have, one of them should be kind-kind . Then run: kubectl config use-context kind-kind to use this context. You can then get started with a local deployment of KServe by using KServe Quick installation script on Kind : curl -s \"https://raw.githubusercontent.com/kserve/kserve/release-0.13/hack/quick_install.sh\" | bash or install via our published Helm Charts: helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.13.0 helm install kserve oci://ghcr.io/kserve/charts/kserve --version v0.13.0","title":"Install the KServe \"Quickstart\" environment"},{"location":"get_started/first_isvc/","text":"Run your first InferenceService \u00b6 In this tutorial, you will deploy an InferenceService with a predictor that will load a scikit-learn model trained with the iris dataset. This dataset has three output classes: Iris Setosa, Iris Versicolour, and Iris Virginica. You will then send an inference request to your deployed model in order to get a prediction for the class of iris plant your request corresponds to. Since your model is being deployed as an InferenceService, not a raw Kubernetes Service, you just need to provide the storage location of the model and it gets some super powers out of the box . 1. Create a namespace \u00b6 First, create a namespace to use for deploying KServe resources: kubectl create namespace kserve-test 2. Create an InferenceService \u00b6 Next, define a new InferenceService YAML for the model and apply it to the cluster. A new predictor schema was introduced in v0.8.0 . New InferenceServices should be deployed using the new schema. The old schema is provided as reference. New Schema Old Schema kubectl apply -n kserve-test -f - < \"./iris-input.json\" { \"instances\": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF Depending on your setup, use one of the following commands to curl the InferenceService : Real DNS Magic DNS From Ingress gateway with HOST Header From local cluster gateway If you have configured the DNS, you can directly curl the InferenceService with the URL obtained from the status print. e.g curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json If you don't want to go through the trouble to get a real domain, you can instead use \"magic\" dns xip.io . The key is to get the external IP for your cluster. kubectl get svc istio-ingressgateway --namespace istio-system Look for the EXTERNAL-IP column's value(in this case 35.237.217.209) NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .51.253.94 35 .237.217.209 Next step is to setting up the custom domain: kubectl edit cm config-domain --namespace knative-serving Now in your editor, change example.com to {{external-ip}}.xip.io (make sure to replace {{external-ip}} with the IP you found earlier). With the change applied you can now directly curl the URL curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test.35.237.217.209.xip.io/v1/models/sklearn-iris:predict -d @./iris-input.json If you do not have DNS, you can still curl with the ingress gateway external IP using the HOST Header. SERVICE_HOSTNAME = $( kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/sklearn-iris:predict\" -d @./iris-input.json If you are calling from in cluster you can curl with the internal url with host {{ InferenceServiceName }}. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test/v1/models/sklearn-iris:predict -d @./iris-input.json You should see two predictions returned (i.e. {\"predictions\": [1, 1]} ). Both sets of data points sent for inference correspond to the flower with index 1 . In this case, the model predicts that both flowers are \"Iris Versicolour\". 6. Run performance test (optional) \u00b6 If you want to load test the deployed model, try deploying the following Kubernetes Job to drive load to the model: # use kubectl create instead of apply because the job template is using generateName which doesn't work with kubectl apply kubectl create -f https://raw.githubusercontent.com/kserve/kserve/release-0.11/docs/samples/v1beta1/sklearn/v1/perf.yaml -n kserve-test Execute the following command to view output: kubectl logs load-test8b58n-rgfxr -n kserve-test Expected Output Requests [ total, rate, throughput ] 30000 , 500 .02, 499 .99 Duration [ total, attack, wait ] 1m0s, 59 .998s, 3 .336ms Latencies [ min, mean, 50 , 90 , 95 , 99 , max ] 1 .743ms, 2 .748ms, 2 .494ms, 3 .363ms, 4 .091ms, 7 .749ms, 46 .354ms Bytes In [ total, mean ] 690000 , 23 .00 Bytes Out [ total, mean ] 2460000 , 82 .00 Success [ ratio ] 100 .00% Status Codes [ code:count ] 200 :30000 Error Set:","title":"First InferenceService"},{"location":"get_started/first_isvc/#run-your-first-inferenceservice","text":"In this tutorial, you will deploy an InferenceService with a predictor that will load a scikit-learn model trained with the iris dataset. This dataset has three output classes: Iris Setosa, Iris Versicolour, and Iris Virginica. You will then send an inference request to your deployed model in order to get a prediction for the class of iris plant your request corresponds to. Since your model is being deployed as an InferenceService, not a raw Kubernetes Service, you just need to provide the storage location of the model and it gets some super powers out of the box .","title":"Run your first InferenceService"},{"location":"get_started/first_isvc/#1-create-a-namespace","text":"First, create a namespace to use for deploying KServe resources: kubectl create namespace kserve-test","title":"1. Create a namespace"},{"location":"get_started/first_isvc/#2-create-an-inferenceservice","text":"Next, define a new InferenceService YAML for the model and apply it to the cluster. A new predictor schema was introduced in v0.8.0 . New InferenceServices should be deployed using the new schema. The old schema is provided as reference. New Schema Old Schema kubectl apply -n kserve-test -f - < \"./iris-input.json\" { \"instances\": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF Depending on your setup, use one of the following commands to curl the InferenceService : Real DNS Magic DNS From Ingress gateway with HOST Header From local cluster gateway If you have configured the DNS, you can directly curl the InferenceService with the URL obtained from the status print. e.g curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test. ${ CUSTOM_DOMAIN } /v1/models/sklearn-iris:predict -d @./iris-input.json If you don't want to go through the trouble to get a real domain, you can instead use \"magic\" dns xip.io . The key is to get the external IP for your cluster. kubectl get svc istio-ingressgateway --namespace istio-system Look for the EXTERNAL-IP column's value(in this case 35.237.217.209) NAME TYPE CLUSTER-IP EXTERNAL-IP PORT ( S ) AGE istio-ingressgateway LoadBalancer 10 .51.253.94 35 .237.217.209 Next step is to setting up the custom domain: kubectl edit cm config-domain --namespace knative-serving Now in your editor, change example.com to {{external-ip}}.xip.io (make sure to replace {{external-ip}} with the IP you found earlier). With the change applied you can now directly curl the URL curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test.35.237.217.209.xip.io/v1/models/sklearn-iris:predict -d @./iris-input.json If you do not have DNS, you can still curl with the ingress gateway external IP using the HOST Header. SERVICE_HOSTNAME = $( kubectl get inferenceservice sklearn-iris -n kserve-test -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) curl -v -H \"Host: ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/sklearn-iris:predict\" -d @./iris-input.json If you are calling from in cluster you can curl with the internal url with host {{ InferenceServiceName }}. curl -v -H \"Content-Type: application/json\" http://sklearn-iris.kserve-test/v1/models/sklearn-iris:predict -d @./iris-input.json You should see two predictions returned (i.e. {\"predictions\": [1, 1]} ). Both sets of data points sent for inference correspond to the flower with index 1 . In this case, the model predicts that both flowers are \"Iris Versicolour\".","title":"5. Perform inference"},{"location":"get_started/first_isvc/#6-run-performance-test-optional","text":"If you want to load test the deployed model, try deploying the following Kubernetes Job to drive load to the model: # use kubectl create instead of apply because the job template is using generateName which doesn't work with kubectl apply kubectl create -f https://raw.githubusercontent.com/kserve/kserve/release-0.11/docs/samples/v1beta1/sklearn/v1/perf.yaml -n kserve-test Execute the following command to view output: kubectl logs load-test8b58n-rgfxr -n kserve-test Expected Output Requests [ total, rate, throughput ] 30000 , 500 .02, 499 .99 Duration [ total, attack, wait ] 1m0s, 59 .998s, 3 .336ms Latencies [ min, mean, 50 , 90 , 95 , 99 , max ] 1 .743ms, 2 .748ms, 2 .494ms, 3 .363ms, 4 .091ms, 7 .749ms, 46 .354ms Bytes In [ total, mean ] 690000 , 23 .00 Bytes Out [ total, mean ] 2460000 , 82 .00 Success [ ratio ] 100 .00% Status Codes [ code:count ] 200 :30000 Error Set:","title":"6. Run performance test (optional)"},{"location":"get_started/swagger_ui/","text":"InferenceService Swagger UI \u00b6 KServe ModelServer is built on top of FastAPI , which brings out-of-box support for OpenAPI specification and Swagger UI . Swagger UI allows visualizing and interacting with the KServe InferenceService API directly in the browser , making it easy for exploring the endpoints and validating the outputs without using any command-line tool. Enable Swagger UI \u00b6 Warning Be careful when enabling this for your production InferenceService deployments since the endpoint does not require authentication at this time. Currently, POST request only work for v2 endpoints in the UI. To enable, simply add an extra argument to the InferenceService YAML example from First Inference chapter: kubectl apply -n kserve-test -f - <.github.io/docs/ Where is your Github handle. After a few moments, your changes should be available for public preview at the link provided by MkDocs! This means you can rapidly prototype and share your changes before making a PR! Navigation \u00b6 Navigation in MkDocs uses the \"mkdocs.yml\" file (found in the /docs directory) to organize navigation. For more in-depth information on Navigation, see: https://www.mkdocs.org/user-guide/writing-your-docs/#configure-pages-and-navigation and https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/ Content Tabs \u00b6 Content tabs are handy way to organize lots of information in a visually pleasing way. Some documentation from https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage is reproduced here: Grouping Code blocks Grouping other content Code blocks are one of the primary targets to be grouped, and can be considered a special case of content tabs, as tabs with a single code block are always rendered without horizontal spacing. Example: === \"C\" ``` c #include int main(void) { printf(\"Hello world!\\n\"); return 0; } ``` === \"C++\" ``` c++ #include int main(void) { std::cout << \"Hello world!\" << std::endl; return 0; } ``` Result: C C++ #include int main ( void ) { printf ( \"Hello world! \\n \" ); return 0 ; } #include int main ( void ) { std :: cout << \"Hello world!\" << std :: endl ; return 0 ; } When a content tab contains more than one code block, it is rendered with horizontal spacing. Vertical spacing is never added, but can be achieved by nesting tabs in other blocks. Example: === \"Unordered list\" * Sed sagittis eleifend rutrum * Donec vitae suscipit est * Nulla tempor lobortis orci === \"Ordered list\" 1. Sed sagittis eleifend rutrum 2. Donec vitae suscipit est 3. Nulla tempor lobortis orci Result: Unordered list Ordered list Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci For more information, see: https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage File Includes (Content Reuse) \u00b6 KServe strives to reduce duplicative effort by reusing commonly used bits of information, see the docs/snippet directory for some examples. Snippets does not require a specific extension, and as long as a valid file name is specified, it will attempt to process it. Snippets can handle recursive file inclusion. And if Snippets encounters the same file in the current stack, it will avoid re-processing it in order to avoid an infinite loop (or crash on hitting max recursion depth). For more info, see: https://facelessuser.github.io/pymdown-extensions/extensions/snippets/ Admonitions \u00b6 We use the following admonition boxes only. Use admonitions sparingly; too many admonitions can be distracting. Admonitions Formatting Note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. Tip A Tip suggests an helpful, but not mandatory, action to take. Warning A Warning draws attention to potential trouble. !!! note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. !!! tip A Tip suggests a helpful, but not mandatory, action to take. !!! warning A Warning draws attention to potential trouble. Icons and Emojis \u00b6 Material for MkDocs supports using Material Icons and Emojis using easy shortcodes. Emojs Formatting :taco: To search a database of Icons and Emojis (all of which can be used on kserve.io), as well as usage information, see: https://squidfunk.github.io/mkdocs-material/reference/icons-emojis/#search Redirects \u00b6 The KServe site uses mkdocs-redirects to \"redirect\" users from a page that may no longer exist (or has been moved) to their desired location. Adding re-directs to the KServe site is done in one centralized place, docs/config/redirects.yml . The format is shown here: plugins: redirects: redirect_maps: ... path_to_old_or_moved_URL : path_to_new_URL","title":"MkDocs Contributions"},{"location":"help/contributor/mkdocs-contributor-guide/#mkdocs-contributions","text":"This is a temporary home for contribution guidelines for the MkDocs branch. When MkDocs becomes \"main\" this will be moved to the appropriate place on the website","title":"MkDocs Contributions"},{"location":"help/contributor/mkdocs-contributor-guide/#install-material-for-mkdocs","text":"kserve.io uses Material for MkDocs to render documentation. Material for MkDocs is Python based and uses pip to install most of it's required packages as well as optional add-ons (which we use). You can choose to install MkDocs locally or using a Docker image. pip actually comes pre-installed with Python so it is included in many operating systems (like MacOSx or Ubuntu) but if you don\u2019t have Python, you can install it here: https://www.python.org For some (e.g. folks using RHEL), you may have to use pip3. pip pip3 pip install mkdocs-material mike More detailed instructions can be found here: https://squidfunk.github.io/mkdocs-material/getting-started/#installation pip3 install mkdocs-material mike More detailed instructions can be found here: https://squidfunk.github.io/mkdocs-material/getting-started/#installation","title":"Install Material for MkDocs"},{"location":"help/contributor/mkdocs-contributor-guide/#install-kserve-specific-extensions","text":"KServe uses a number of extensions to MkDocs which can also be installed using pip. If you used pip to install, run the following: pip pip3 pip install mkdocs-material-extensions mkdocs-macros-plugin mkdocs-exclude mkdocs-awesome-pages-plugin mkdocs-redirects pip3 install mkdocs-material-extensions mkdocs-macros-plugin mkdocs-exclude mkdocs-awesome-pages-plugin mkdocs-redirects","title":"Install KServe-Specific Extensions"},{"location":"help/contributor/mkdocs-contributor-guide/#install-dependencies-in-requirementstxt-file","text":"Navigate to root folder and run below command to install required packages and libraries specified in the requirements.txt file. pip pip3 pip install -r requirements.txt pip3 install -r requirements.txt","title":"Install Dependencies in Requirements.txt file"},{"location":"help/contributor/mkdocs-contributor-guide/#setting-up-local-preview","text":"Once you have installed Material for MkDocs and all of the extensions, head over to and clone the repo. In your terminal, find your way over to the location of the cloned repo. Once you are in the main folder and run: Local Preview Local Preview w/ Dirty Reload Local Preview including Blog and Community Site mkdocs serve If you\u2019re only changing a single page in the /docs/ folder (i.e. not the homepage or mkdocs.yml) adding the flag --dirtyreload will make the site rebuild super crazy insta-fast. mkdocs serve --dirtyreload First, install the necessary extensions: npm install -g postcss postcss-cli autoprefixer http-server Once you have those npm packages installed, run: ./hack/build-with-blog.sh serve Note Unfortunately, there aren\u2019t live previews for this version of the local preview. After awhile, your terminal should spit out: INFO - Documentation built in 13 .54 seconds [ I 210519 10 :47:10 server:335 ] Serving on http://127.0.0.1:8000 [ I 210519 10 :47:10 handlers:62 ] Start watching changes [ I 210519 10 :47:10 handlers:64 ] Start detecting changes Now access http://127.0.0.1:8000 and you should see the site is built! \ud83c\udf89 Anytime you change any file in your /docs/ repo and hit save, the site will automatically rebuild itself to reflect your changes!","title":"Setting Up Local Preview"},{"location":"help/contributor/mkdocs-contributor-guide/#setting-up-public-preview","text":"If, for whatever reason, you want to share your work before submitting a PR (where Netlify would generate a preview for you), you can deploy your changes as a Github Page easily using the following command: mkdocs gh-deploy --force INFO - Documentation built in 14 .29 seconds WARNING - Version check skipped: No version specified in previous deployment. INFO - Your documentation should shortly be available at: https://.github.io/docs/ Where is your Github handle. After a few moments, your changes should be available for public preview at the link provided by MkDocs! This means you can rapidly prototype and share your changes before making a PR!","title":"Setting Up \"Public\" Preview"},{"location":"help/contributor/mkdocs-contributor-guide/#navigation","text":"Navigation in MkDocs uses the \"mkdocs.yml\" file (found in the /docs directory) to organize navigation. For more in-depth information on Navigation, see: https://www.mkdocs.org/user-guide/writing-your-docs/#configure-pages-and-navigation and https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/","title":"Navigation"},{"location":"help/contributor/mkdocs-contributor-guide/#content-tabs","text":"Content tabs are handy way to organize lots of information in a visually pleasing way. Some documentation from https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage is reproduced here: Grouping Code blocks Grouping other content Code blocks are one of the primary targets to be grouped, and can be considered a special case of content tabs, as tabs with a single code block are always rendered without horizontal spacing. Example: === \"C\" ``` c #include int main(void) { printf(\"Hello world!\\n\"); return 0; } ``` === \"C++\" ``` c++ #include int main(void) { std::cout << \"Hello world!\" << std::endl; return 0; } ``` Result: C C++ #include int main ( void ) { printf ( \"Hello world! \\n \" ); return 0 ; } #include int main ( void ) { std :: cout << \"Hello world!\" << std :: endl ; return 0 ; } When a content tab contains more than one code block, it is rendered with horizontal spacing. Vertical spacing is never added, but can be achieved by nesting tabs in other blocks. Example: === \"Unordered list\" * Sed sagittis eleifend rutrum * Donec vitae suscipit est * Nulla tempor lobortis orci === \"Ordered list\" 1. Sed sagittis eleifend rutrum 2. Donec vitae suscipit est 3. Nulla tempor lobortis orci Result: Unordered list Ordered list Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci Sed sagittis eleifend rutrum Donec vitae suscipit est Nulla tempor lobortis orci For more information, see: https://squidfunk.github.io/mkdocs-material/reference/content-tabs/#usage","title":"Content Tabs"},{"location":"help/contributor/mkdocs-contributor-guide/#file-includes-content-reuse","text":"KServe strives to reduce duplicative effort by reusing commonly used bits of information, see the docs/snippet directory for some examples. Snippets does not require a specific extension, and as long as a valid file name is specified, it will attempt to process it. Snippets can handle recursive file inclusion. And if Snippets encounters the same file in the current stack, it will avoid re-processing it in order to avoid an infinite loop (or crash on hitting max recursion depth). For more info, see: https://facelessuser.github.io/pymdown-extensions/extensions/snippets/","title":"File Includes (Content Reuse)"},{"location":"help/contributor/mkdocs-contributor-guide/#admonitions","text":"We use the following admonition boxes only. Use admonitions sparingly; too many admonitions can be distracting. Admonitions Formatting Note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. Tip A Tip suggests an helpful, but not mandatory, action to take. Warning A Warning draws attention to potential trouble. !!! note A Note contains information that is useful, but not essential. A reader can skip a note without bypassing required information. If the information suggests an action to take, use a tip instead. !!! tip A Tip suggests a helpful, but not mandatory, action to take. !!! warning A Warning draws attention to potential trouble.","title":"Admonitions"},{"location":"help/contributor/mkdocs-contributor-guide/#icons-and-emojis","text":"Material for MkDocs supports using Material Icons and Emojis using easy shortcodes. Emojs Formatting :taco: To search a database of Icons and Emojis (all of which can be used on kserve.io), as well as usage information, see: https://squidfunk.github.io/mkdocs-material/reference/icons-emojis/#search","title":"Icons and Emojis"},{"location":"help/contributor/mkdocs-contributor-guide/#redirects","text":"The KServe site uses mkdocs-redirects to \"redirect\" users from a page that may no longer exist (or has been moved) to their desired location. Adding re-directs to the KServe site is done in one centralized place, docs/config/redirects.yml . The format is shown here: plugins: redirects: redirect_maps: ... path_to_old_or_moved_URL : path_to_new_URL","title":"Redirects"},{"location":"help/contributor/templates/template-blog/","text":"Blog template instructions \u00b6 An example template with best-practices that you can use to start drafting an entry to post on the KServe blog. Copy a version of this template without the instructions Include a commented-out table with tracking info about reviews and approvals: | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> Blog content body \u00b6 Example step/section 1: \u00b6 Example step/section 2: \u00b6 Example step/section 3: \u00b6 Example section about results \u00b6 Further reading \u00b6 About the author \u00b6 Copy the template \u00b6 | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> # ## Blog content body ### Example step/section 1: ### Example step/section 2: ### Example step/section 3: ### Example section about results ## Further reading ## About the author ","title":"Blog template instructions"},{"location":"help/contributor/templates/template-blog/#blog-template-instructions","text":"An example template with best-practices that you can use to start drafting an entry to post on the KServe blog. Copy a version of this template without the instructions Include a commented-out table with tracking info about reviews and approvals: | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | -->","title":"Blog template instructions"},{"location":"help/contributor/templates/template-blog/#blog-content-body","text":" ","title":"Blog content body"},{"location":"help/contributor/templates/template-blog/#example-stepsection-1","text":"","title":"Example step/section 1:"},{"location":"help/contributor/templates/template-blog/#example-stepsection-2","text":"","title":"Example step/section 2:"},{"location":"help/contributor/templates/template-blog/#example-stepsection-3","text":"","title":"Example step/section 3:"},{"location":"help/contributor/templates/template-blog/#example-section-about-results","text":"","title":"Example section about results"},{"location":"help/contributor/templates/template-blog/#further-reading","text":"","title":"Further reading"},{"location":"help/contributor/templates/template-blog/#about-the-author","text":"","title":"About the author"},{"location":"help/contributor/templates/template-blog/#copy-the-template","text":" | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | | | YYYY-MM-DD | :+1:, :monocle_face:, :-1: | --> # ## Blog content body ### Example step/section 1: ### Example step/section 2: ### Example step/section 3: ### Example section about results ## Further reading ## About the author ","title":"Copy the template"},{"location":"help/contributor/templates/template-concept/","text":"Concept Template \u00b6 Use this template when writing conceptual topics. Conceptual topics explain how things work or what things mean. They provide helpful context to readers. They do not include procedures. Template \u00b6 The following template includes the standard sections that should appear in conceptual topics, including a topic introduction sentence, an overview, and placeholders for additional sections and subsections. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic describes what KServe is and how it works.\" ## Overview Write a few sentences describing the subject of the topic. ## Section Title Write a sentence or two to describe the content in this section. Create more sections as necessary. Optionally, add two or more subsections to each section. Do not skip header levels: H2 >> H3, not H2 >> H4. ### Subsection Title Write a sentence or two to describe the content in this section. ### Subsection Title Write a sentence or two to describe the content in this section. Conceptual Content Samples \u00b6 This section provides common content types that appear in conceptual topics. Copy and paste the markdown to use it in your topic. Table \u00b6 Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.\u201d Markdown Table Template \u00b6 Header 1 Header 2 Data1 Data2 Data3 Data4 Ordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item. Markdown Ordered List Templates \u00b6 Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3 Unordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item. Markdown Unordered List Template \u00b6 List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item Note \u00b6 Ensure the text beneath the note is indented as much as note is. Note This is a note. Warning \u00b6 If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Concept Template"},{"location":"help/contributor/templates/template-concept/#concept-template","text":"Use this template when writing conceptual topics. Conceptual topics explain how things work or what things mean. They provide helpful context to readers. They do not include procedures.","title":"Concept Template"},{"location":"help/contributor/templates/template-concept/#template","text":"The following template includes the standard sections that should appear in conceptual topics, including a topic introduction sentence, an overview, and placeholders for additional sections and subsections. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic describes what KServe is and how it works.\" ## Overview Write a few sentences describing the subject of the topic. ## Section Title Write a sentence or two to describe the content in this section. Create more sections as necessary. Optionally, add two or more subsections to each section. Do not skip header levels: H2 >> H3, not H2 >> H4. ### Subsection Title Write a sentence or two to describe the content in this section. ### Subsection Title Write a sentence or two to describe the content in this section.","title":"Template"},{"location":"help/contributor/templates/template-concept/#conceptual-content-samples","text":"This section provides common content types that appear in conceptual topics. Copy and paste the markdown to use it in your topic.","title":"Conceptual Content Samples"},{"location":"help/contributor/templates/template-concept/#table","text":"Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.\u201d","title":"Table"},{"location":"help/contributor/templates/template-concept/#markdown-table-template","text":"Header 1 Header 2 Data1 Data2 Data3 Data4","title":"Markdown Table Template"},{"location":"help/contributor/templates/template-concept/#ordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item.","title":"Ordered List"},{"location":"help/contributor/templates/template-concept/#markdown-ordered-list-templates","text":"Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3","title":"Markdown Ordered List Templates"},{"location":"help/contributor/templates/template-concept/#unordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item.","title":"Unordered List"},{"location":"help/contributor/templates/template-concept/#markdown-unordered-list-template","text":"List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item","title":"Markdown Unordered List Template"},{"location":"help/contributor/templates/template-concept/#note","text":"Ensure the text beneath the note is indented as much as note is. Note This is a note.","title":"Note"},{"location":"help/contributor/templates/template-concept/#warning","text":"If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Warning"},{"location":"help/contributor/templates/template-procedure/","text":"Procedure template \u00b6 Use this template when writing procedural (how-to) topics. Procedural topics include detailed steps to perform a task as well as some context about the task. Template \u00b6 The following template includes the standard sections that should appear in procedural topics, including a topic sentence, an overview section, and sections for each task within the procedure. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic instructs how to serve a TensorFlow model.\" ## Overview Write a few sentences to describe the subject of the topic, if useful. For example, if the topic is about configuring a broker, you might provide some useful context about brokers. If there are multiple tasks in the procedure and they must be completed in order, create an ordered list that contains each task in the topic. Use bullets for sub-tasks. Include anchor links to the headings for each task. To [task]: 1. [Name of Task 1 (for example, Apply default configuration)](#task-1) 1. [Optional: Name of Task 2](#task-2) !!! note Unless the number of tasks in the procedure is particularly high, do not use numbered lead-ins in the task headings. For example, instead of \"Task 1: Apply default configuration\", use \"Apply default configuration\". ## Prerequisites Use one of the following formats for the Prerequisites section. ### Formatting for two or more prerequisites If there are two or more prerequisites, use the following format. Include links for more information, if necessary. Before you [task], you must have/do: * Prerequisite. See [Link](). * Prerequisite. See [Link](). For example: Before you deploy PyTorch model, you must have: * KServe. See [Installing the KServe](link-to-that-topic). * An Apache Kafka cluster. See [Link to Instructions to Download](link-to-that-topic). ### Format for one prerequisite If there is one prerequisite, use the following format. Include a link for more information, if necessary. Before you [task], you must have/do [prerequisite]. See [Link](link). For example: Before you create the `InferenceService`, you must have a Kubernetes cluster with KServe installed and DNS configured. See the [installation instructions](../../../install/README.md) if you need to create one. ## Task 1 Write a few sentences to describe the task and provide additional context on the task. !!! note When writing a single-step procedure, write the step in one sentence and make it a bullet. The signposting is important given readers are strongly inclined to look for numbered steps and bullet points when searching for instructions. If possible, expand the procedure to include at least one more step. Few procedures truly require a single step. [Task]: 1. Step 1 1. Step 2 ## Optional: Task 2 If the task is optional, put \"Optional:\" in the heading. Write a few sentences to describe the task and provide additional context on the task. [Task]: 1. Step 1 2. Step 2 Procedure Content Samples \u00b6 This section provides common content types that appear in procedural topics. Copy and paste the markdown to use it in your topic. \u201cFill-in-the-Fields\u201d Table \u00b6 Where the reader must enter many values in, for example, a YAML file, use a table within the procedure as follows: Open the YAML file. Key1 : Value1 Key2 : Value2 metadata : annotations : # case-sensitive Key3 : Value3 Key4 : Value4 Key5 : Value5 spec : # Configuration specific to this broker. config : Key6 : Value6 Change the relevant values to your needs, using the following table as a guide. Key Value Type Description Key1 String Description Key2 Integer Description Key3 String Description Key4 String Description Key5 Float Description Key6 String Description Table \u00b6 Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework. Markdown Table Template \u00b6 Header 1 Header 2 Data1 Data2 Data3 Data4 Ordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item. Markdown Ordered List Templates \u00b6 Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3 Unordered List \u00b6 Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item. Markdown Unordered List Template \u00b6 List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item Note \u00b6 Ensure the text beneath the note is indented as much as note is. Note This is a note. Warning \u00b6 If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning. Markdown Embedded Image \u00b6 The following is an embedded image reference in markdown. Tabs \u00b6 Place multiple versions of the same procedure (such as a CLI procedure vs a YAML procedure) within tabs. Indent the opening tabs tags 3 spaces to make the tabs display properly. == \"tab1 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. == \"tab2 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. Documenting Code and Code Snippets \u00b6 For instructions on how to format code and code snippets, see the Style Guide.","title":"Procedure template"},{"location":"help/contributor/templates/template-procedure/#procedure-template","text":"Use this template when writing procedural (how-to) topics. Procedural topics include detailed steps to perform a task as well as some context about the task.","title":"Procedure template"},{"location":"help/contributor/templates/template-procedure/#template","text":"The following template includes the standard sections that should appear in procedural topics, including a topic sentence, an overview section, and sections for each task within the procedure. Copy and paste the markdown from the template to use it in your topic. This topic describes... Write a sentence or two that describes the topic itself, not the subject of the topic. The goal of the topic sentence is to help readers understand if this topic is for them. For example, \"This topic instructs how to serve a TensorFlow model.\" ## Overview Write a few sentences to describe the subject of the topic, if useful. For example, if the topic is about configuring a broker, you might provide some useful context about brokers. If there are multiple tasks in the procedure and they must be completed in order, create an ordered list that contains each task in the topic. Use bullets for sub-tasks. Include anchor links to the headings for each task. To [task]: 1. [Name of Task 1 (for example, Apply default configuration)](#task-1) 1. [Optional: Name of Task 2](#task-2) !!! note Unless the number of tasks in the procedure is particularly high, do not use numbered lead-ins in the task headings. For example, instead of \"Task 1: Apply default configuration\", use \"Apply default configuration\". ## Prerequisites Use one of the following formats for the Prerequisites section. ### Formatting for two or more prerequisites If there are two or more prerequisites, use the following format. Include links for more information, if necessary. Before you [task], you must have/do: * Prerequisite. See [Link](). * Prerequisite. See [Link](). For example: Before you deploy PyTorch model, you must have: * KServe. See [Installing the KServe](link-to-that-topic). * An Apache Kafka cluster. See [Link to Instructions to Download](link-to-that-topic). ### Format for one prerequisite If there is one prerequisite, use the following format. Include a link for more information, if necessary. Before you [task], you must have/do [prerequisite]. See [Link](link). For example: Before you create the `InferenceService`, you must have a Kubernetes cluster with KServe installed and DNS configured. See the [installation instructions](../../../install/README.md) if you need to create one. ## Task 1 Write a few sentences to describe the task and provide additional context on the task. !!! note When writing a single-step procedure, write the step in one sentence and make it a bullet. The signposting is important given readers are strongly inclined to look for numbered steps and bullet points when searching for instructions. If possible, expand the procedure to include at least one more step. Few procedures truly require a single step. [Task]: 1. Step 1 1. Step 2 ## Optional: Task 2 If the task is optional, put \"Optional:\" in the heading. Write a few sentences to describe the task and provide additional context on the task. [Task]: 1. Step 1 2. Step 2","title":"Template"},{"location":"help/contributor/templates/template-procedure/#procedure-content-samples","text":"This section provides common content types that appear in procedural topics. Copy and paste the markdown to use it in your topic.","title":"Procedure Content Samples"},{"location":"help/contributor/templates/template-procedure/#fill-in-the-fields-table","text":"Where the reader must enter many values in, for example, a YAML file, use a table within the procedure as follows: Open the YAML file. Key1 : Value1 Key2 : Value2 metadata : annotations : # case-sensitive Key3 : Value3 Key4 : Value4 Key5 : Value5 spec : # Configuration specific to this broker. config : Key6 : Value6 Change the relevant values to your needs, using the following table as a guide. Key Value Type Description Key1 String Description Key2 Integer Description Key3 String Description Key4 String Description Key5 Float Description Key6 String Description","title":"\u201cFill-in-the-Fields\u201d Table"},{"location":"help/contributor/templates/template-procedure/#table","text":"Introduce the table with a sentence. For example, \u201cThe following table lists which features are available to a KServe supported ML framework.","title":"Table"},{"location":"help/contributor/templates/template-procedure/#markdown-table-template","text":"Header 1 Header 2 Data1 Data2 Data3 Data4","title":"Markdown Table Template"},{"location":"help/contributor/templates/template-procedure/#ordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cIf you want to fix or add content to a past release, you can find the source files in the following folders.\u201d. Optionally, include bold lead-ins before each list item.","title":"Ordered List"},{"location":"help/contributor/templates/template-procedure/#markdown-ordered-list-templates","text":"Item 1 Item 2 Item 3 Lead-in description: Item 1 Lead-in description: Item 2 Lead-in description: Item 3","title":"Markdown Ordered List Templates"},{"location":"help/contributor/templates/template-procedure/#unordered-list","text":"Write a sentence or two to introduce the content of the list. For example, \u201cYour own path to becoming a KServe contributor can begin in any of the following components:\u201d. Optionally, include bold lead-ins before each list item.","title":"Unordered List"},{"location":"help/contributor/templates/template-procedure/#markdown-unordered-list-template","text":"List item List item List item Lead-in : List item Lead-in : List item Lead-in : List item","title":"Markdown Unordered List Template"},{"location":"help/contributor/templates/template-procedure/#note","text":"Ensure the text beneath the note is indented as much as note is. Note This is a note.","title":"Note"},{"location":"help/contributor/templates/template-procedure/#warning","text":"If the note regards an issue that could lead to data loss, the note should be a warning. Warning This is a warning.","title":"Warning"},{"location":"help/contributor/templates/template-procedure/#markdown-embedded-image","text":"The following is an embedded image reference in markdown.","title":"Markdown Embedded Image"},{"location":"help/contributor/templates/template-procedure/#tabs","text":"Place multiple versions of the same procedure (such as a CLI procedure vs a YAML procedure) within tabs. Indent the opening tabs tags 3 spaces to make the tabs display properly. == \"tab1 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step. == \"tab2 name\" This is a stem: 1. This is a step. ``` This is some code. ``` 1. This is another step.","title":"Tabs"},{"location":"help/contributor/templates/template-procedure/#documenting-code-and-code-snippets","text":"For instructions on how to format code and code snippets, see the Style Guide.","title":"Documenting Code and Code Snippets"},{"location":"help/contributor/templates/template-troubleshooting/","text":"Troubleshooting template \u00b6 When writing guidance to help to troubleshoot specific errors, the error must include: Error Description: To describe the error very briefly so that users can search for it easily. Symptom: To describe the error in a way that helps users to diagnose their issue. Include error messages or anything else users might see if they encounter this error. Explanation (or cause): To inform users about why they are seeing this error. This can be omitted if the cause of the error is unknown. Solution: To inform the user about how to fix the error. Example Troubleshooting Table \u00b6 Troubleshooting \u00b6 | Error Description | |----------|------------| | Symptom | During the event something breaks. | | Cause | The thing is broken. | | Solution | To solve this issue, do the following: 1. This. 2. That. |","title":"Troubleshooting template"},{"location":"help/contributor/templates/template-troubleshooting/#troubleshooting-template","text":"When writing guidance to help to troubleshoot specific errors, the error must include: Error Description: To describe the error very briefly so that users can search for it easily. Symptom: To describe the error in a way that helps users to diagnose their issue. Include error messages or anything else users might see if they encounter this error. Explanation (or cause): To inform users about why they are seeing this error. This can be omitted if the cause of the error is unknown. Solution: To inform the user about how to fix the error.","title":"Troubleshooting template"},{"location":"help/contributor/templates/template-troubleshooting/#example-troubleshooting-table","text":"","title":"Example Troubleshooting Table"},{"location":"help/contributor/templates/template-troubleshooting/#troubleshooting","text":"| Error Description | |----------|------------| | Symptom | During the event something breaks. | | Cause | The thing is broken. | | Solution | To solve this issue, do the following: 1. This. 2. That. |","title":"Troubleshooting"},{"location":"help/style-guide/documenting-code/","text":"Documenting Code \u00b6 Words requiring code formatting \u00b6 Apply code formatting only to special-purpose text: Filenames Path names Fields and values from a YAML file Any text that goes into a CLI CLI names Specify the programming language \u00b6 Specify the language your code is in as part of the code block Specify non-language specific code, like CLI commands, with ```bash. See the following examples for formatting. Correct Incorrect Correct Formatting Incorrect Formatting package main import \"fmt\" func main () { fmt . Println ( \"hello world\" ) } package main import \"fmt\" func main () { fmt.Println ( \"hello world\" ) } ```go package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` ```bash package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` Documenting YAML \u00b6 When documenting YAML, use two steps. Use step 1 to create the YAML file, and step 2 to apply the YAML file. Use kubectl apply for files/objects that the user creates: it works for both \u201ccreate\u201d and \u201cupdate\u201d, and the source of truth is their local files. Use kubectl edit for files which are shipped as part of the KServe software, like the KServe ConfigMaps. Write ```yaml at the beginning of your code block if you are typing YAML code as part of a CLI command. Correct Incorrect Creating or updating a resource: Create a YAML file using the following template: # YAML FILE CONTENTS Apply the YAML file by running the command: kubectl apply -f .yaml Where is the name of the file you created in the previous step. Editing a ConfigMap: kubectl -n edit configmap Example 1: cat < is\u2026\" Single variable \u00b6 Correct Incorrect kubectl get isvc Where is the name of your InferenceService. kubectl get isvc { SERVICE_NAME } {SERVICE_NAME} = The name of your service Multiple variables \u00b6 Correct Incorrect kn create service --revision-name Where: is the name of your Knative Service. is the desired name of your revision. kn create service --revision-name Where is the name of your Knative Service. Where is the desired name of your revision. CLI output \u00b6 CLI Output should include the custom css \"{ .bash .no-copy }\" in place of \"bash\" which removes the \"Copy to clipboard button\" on the right side of the code block Correct Incorrect Correct Formatting Incorrect Formatting ```{ .bash .no-copy } ``` ```bash ```","title":"Documenting Code"},{"location":"help/style-guide/documenting-code/#documenting-code","text":"","title":"Documenting Code"},{"location":"help/style-guide/documenting-code/#words-requiring-code-formatting","text":"Apply code formatting only to special-purpose text: Filenames Path names Fields and values from a YAML file Any text that goes into a CLI CLI names","title":"Words requiring code formatting"},{"location":"help/style-guide/documenting-code/#specify-the-programming-language","text":"Specify the language your code is in as part of the code block Specify non-language specific code, like CLI commands, with ```bash. See the following examples for formatting. Correct Incorrect Correct Formatting Incorrect Formatting package main import \"fmt\" func main () { fmt . Println ( \"hello world\" ) } package main import \"fmt\" func main () { fmt.Println ( \"hello world\" ) } ```go package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ``` ```bash package main import \"fmt\" func main() { fmt.Println(\"hello world\") } ```","title":"Specify the programming language"},{"location":"help/style-guide/documenting-code/#documenting-yaml","text":"When documenting YAML, use two steps. Use step 1 to create the YAML file, and step 2 to apply the YAML file. Use kubectl apply for files/objects that the user creates: it works for both \u201ccreate\u201d and \u201cupdate\u201d, and the source of truth is their local files. Use kubectl edit for files which are shipped as part of the KServe software, like the KServe ConfigMaps. Write ```yaml at the beginning of your code block if you are typing YAML code as part of a CLI command. Correct Incorrect Creating or updating a resource: Create a YAML file using the following template: # YAML FILE CONTENTS Apply the YAML file by running the command: kubectl apply -f .yaml Where is the name of the file you created in the previous step. Editing a ConfigMap: kubectl -n edit configmap Example 1: cat < is\u2026\"","title":"Referencing variables in code blocks"},{"location":"help/style-guide/documenting-code/#single-variable","text":"Correct Incorrect kubectl get isvc Where is the name of your InferenceService. kubectl get isvc { SERVICE_NAME } {SERVICE_NAME} = The name of your service","title":"Single variable"},{"location":"help/style-guide/documenting-code/#multiple-variables","text":"Correct Incorrect kn create service --revision-name Where: is the name of your Knative Service. is the desired name of your revision. kn create service --revision-name Where is the name of your Knative Service. Where is the desired name of your revision.","title":"Multiple variables"},{"location":"help/style-guide/documenting-code/#cli-output","text":"CLI Output should include the custom css \"{ .bash .no-copy }\" in place of \"bash\" which removes the \"Copy to clipboard button\" on the right side of the code block Correct Incorrect Correct Formatting Incorrect Formatting ```{ .bash .no-copy } ``` ```bash ```","title":"CLI output"},{"location":"help/style-guide/style-and-formatting/","text":"Formatting standards and conventions \u00b6 Titles and headings \u00b6 Use sentence case for titles and headings \u00b6 Only capitalize proper nouns, acronyms, and the first word of the heading. Correct Incorrect ## Configure the feature ## Configure the Feature ### Using feature ### Using Feature ### Using HTTPS ### Using https Do not use code formatting inside headings \u00b6 Correct Incorrect ## Configure the class annotation ## Configure the `class` annotation Use imperatives for headings of procedures \u00b6 For consistency, brevity, and to better signpost where action is expected of the reader, make procedure headings imperatives. Correct Incorrect ## Install KServe ## Installation of KServe ### Configure DNS ### Configuring DNS ## Verify the installation ## How to verify the installation Links \u00b6 Describe what the link targets \u00b6 Correct Incorrect For an explanation of what makes a good hyperlink, see this this article . See this article here . Write links in Markdown, not HTML \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) Kafka Broker [Kafka Broker](../kafka-broker/README.md){target=_blank} Kafka Broker Include the .md extension in internal links \u00b6 Correct Incorrect [Setting up a custom domain](../serving/using-a-custom-domain.md) [Setting up a custom domain](../serving/using-a-custom-domain) Link to files, not folders \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/) Ensure the letter case is correct \u00b6 Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/readme.md) Formatting \u00b6 Use nonbreaking spaces in units of measurement other than percent \u00b6 For most units of measurement, when you specify a number with the unit, use a nonbreaking space between the number and the unit. Don't use spacing when the unit of measurement is percent. Correct Incorrect 3   GB 3 GB 4   CPUs 4 CPUs 14% 14   % Use bold for user interface elements \u00b6 Correct Incorrect Click Fork Click \"Fork\" Select Other Select \"Other\" Use tables for definition lists \u00b6 When listing terms and their definitions, use table formatting instead of definition list formatting. Correct Incorrect |Value |Description | |------|---------------------| |Value1|Description of Value1| |Value2|Description of Value2| Value1 : Description of Value1 Value2 : Description of Value2 General style \u00b6 Use upper camel case for KServe API objects \u00b6 Correct Incorrect Explainers explainers Transformer transformer InferenceService Inference Service Only use parentheses for acronym explanations \u00b6 Put an acronym inside parentheses after its explanation. Don\u2019t use parentheses for anything else. Parenthetical statements especially should be avoided because readers skip them. If something is important enough to be in the sentence, it should be fully part of that sentence. Correct Incorrect Custom Resource Definition (CRD) Check your CLI (you should see it there) Knative Serving creates a Revision Knative creates a Revision (a stateless, snapshot in time of your code and configuration) Use the international standard for punctuation inside quotes \u00b6 Correct Incorrect Events are recorded with an associated \"stage\". Events are recorded with an associated \"stage.\" The copy is called a \"fork\". The copy is called a \"fork.\"","title":"Formatting standards and conventions"},{"location":"help/style-guide/style-and-formatting/#formatting-standards-and-conventions","text":"","title":"Formatting standards and conventions"},{"location":"help/style-guide/style-and-formatting/#titles-and-headings","text":"","title":"Titles and headings"},{"location":"help/style-guide/style-and-formatting/#use-sentence-case-for-titles-and-headings","text":"Only capitalize proper nouns, acronyms, and the first word of the heading. Correct Incorrect ## Configure the feature ## Configure the Feature ### Using feature ### Using Feature ### Using HTTPS ### Using https","title":"Use sentence case for titles and headings"},{"location":"help/style-guide/style-and-formatting/#do-not-use-code-formatting-inside-headings","text":"Correct Incorrect ## Configure the class annotation ## Configure the `class` annotation","title":"Do not use code formatting inside headings"},{"location":"help/style-guide/style-and-formatting/#use-imperatives-for-headings-of-procedures","text":"For consistency, brevity, and to better signpost where action is expected of the reader, make procedure headings imperatives. Correct Incorrect ## Install KServe ## Installation of KServe ### Configure DNS ### Configuring DNS ## Verify the installation ## How to verify the installation","title":"Use imperatives for headings of procedures"},{"location":"help/style-guide/style-and-formatting/#links","text":"","title":"Links"},{"location":"help/style-guide/style-and-formatting/#describe-what-the-link-targets","text":"Correct Incorrect For an explanation of what makes a good hyperlink, see this this article . See this article here .","title":"Describe what the link targets"},{"location":"help/style-guide/style-and-formatting/#write-links-in-markdown-not-html","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) Kafka Broker [Kafka Broker](../kafka-broker/README.md){target=_blank} Kafka Broker","title":"Write links in Markdown, not HTML"},{"location":"help/style-guide/style-and-formatting/#include-the-md-extension-in-internal-links","text":"Correct Incorrect [Setting up a custom domain](../serving/using-a-custom-domain.md) [Setting up a custom domain](../serving/using-a-custom-domain)","title":"Include the .md extension in internal links"},{"location":"help/style-guide/style-and-formatting/#link-to-files-not-folders","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/)","title":"Link to files, not folders"},{"location":"help/style-guide/style-and-formatting/#ensure-the-letter-case-is-correct","text":"Correct Incorrect [Kafka Broker](../kafka-broker/README.md) [Kafka Broker](../kafka-broker/readme.md)","title":"Ensure the letter case is correct"},{"location":"help/style-guide/style-and-formatting/#formatting","text":"","title":"Formatting"},{"location":"help/style-guide/style-and-formatting/#use-nonbreaking-spaces-in-units-of-measurement-other-than-percent","text":"For most units of measurement, when you specify a number with the unit, use a nonbreaking space between the number and the unit. Don't use spacing when the unit of measurement is percent. Correct Incorrect 3   GB 3 GB 4   CPUs 4 CPUs 14% 14   %","title":"Use nonbreaking spaces in units of measurement other than percent"},{"location":"help/style-guide/style-and-formatting/#use-bold-for-user-interface-elements","text":"Correct Incorrect Click Fork Click \"Fork\" Select Other Select \"Other\"","title":"Use bold for user interface elements"},{"location":"help/style-guide/style-and-formatting/#use-tables-for-definition-lists","text":"When listing terms and their definitions, use table formatting instead of definition list formatting. Correct Incorrect |Value |Description | |------|---------------------| |Value1|Description of Value1| |Value2|Description of Value2| Value1 : Description of Value1 Value2 : Description of Value2","title":"Use tables for definition lists"},{"location":"help/style-guide/style-and-formatting/#general-style","text":"","title":"General style"},{"location":"help/style-guide/style-and-formatting/#use-upper-camel-case-for-kserve-api-objects","text":"Correct Incorrect Explainers explainers Transformer transformer InferenceService Inference Service","title":"Use upper camel case for KServe API objects"},{"location":"help/style-guide/style-and-formatting/#only-use-parentheses-for-acronym-explanations","text":"Put an acronym inside parentheses after its explanation. Don\u2019t use parentheses for anything else. Parenthetical statements especially should be avoided because readers skip them. If something is important enough to be in the sentence, it should be fully part of that sentence. Correct Incorrect Custom Resource Definition (CRD) Check your CLI (you should see it there) Knative Serving creates a Revision Knative creates a Revision (a stateless, snapshot in time of your code and configuration)","title":"Only use parentheses for acronym explanations"},{"location":"help/style-guide/style-and-formatting/#use-the-international-standard-for-punctuation-inside-quotes","text":"Correct Incorrect Events are recorded with an associated \"stage\". Events are recorded with an associated \"stage.\" The copy is called a \"fork\". The copy is called a \"fork.\"","title":"Use the international standard for punctuation inside quotes"},{"location":"help/style-guide/voice-and-language/","text":"Voice and language \u00b6 Use present tense \u00b6 Correct Incorrect This command starts a proxy. This command will start a proxy. Use active voice \u00b6 Correct Incorrect You can explore the API using a browser. The API can be explored using a browser. The YAML file specifies the replica count. The replica count is specified in the YAML file. Use simple and direct language \u00b6 Use simple and direct language. Avoid using unnecessary words, such as \"please\". Correct Incorrect To create a ReplicaSet , ... In order to create a ReplicaSet , ... See the configuration file. Please see the configuration file. View the Pods. With this next command, we'll view the Pods. Address the reader as \"you\", not \"we\" \u00b6 Correct Incorrect You can create a Deployment by ... We can create a Deployment by ... In the preceding output, you can see... In the preceding output, we can see ... This page teaches you how to use pods. In this page, we are going to learn about pods. Avoid jargon, idioms, and Latin \u00b6 Some readers speak English as a second language. Avoid jargon, idioms, and Latin to help make their understanding easier. Correct Incorrect Internally, ... Under the hood, ... Create a new cluster. Turn up a new cluster. Initially, ... Out of the box, ... For example, ... e.g., ... Enter through the gateway ... Enter via the gateway ... Avoid statements about the future \u00b6 Avoid making promises or giving hints about the future. If you need to talk about a feature in development, add a boilerplate under the front matter that identifies the information accordingly. Avoid statements that will soon be out of date \u00b6 Avoid using wording that becomes outdated quickly like \"currently\" and \"new\". A feature that is new today is not new for long. Correct Incorrect In version 1.4, ... In the current version, ... The Federation feature provides ... The new Federation feature provides ... Avoid words that assume a specific level of understanding \u00b6 Avoid words such as \"just\", \"simply\", \"easy\", \"easily\", or \"simple\". These words do not add value. Correct Incorrect Include one command in ... Include just one command in ... Run the container ... Simply run the container ... You can remove ... You can easily remove ... These steps ... These simple steps ...","title":"Voice and language"},{"location":"help/style-guide/voice-and-language/#voice-and-language","text":"","title":"Voice and language"},{"location":"help/style-guide/voice-and-language/#use-present-tense","text":"Correct Incorrect This command starts a proxy. This command will start a proxy.","title":"Use present tense"},{"location":"help/style-guide/voice-and-language/#use-active-voice","text":"Correct Incorrect You can explore the API using a browser. The API can be explored using a browser. The YAML file specifies the replica count. The replica count is specified in the YAML file.","title":"Use active voice"},{"location":"help/style-guide/voice-and-language/#use-simple-and-direct-language","text":"Use simple and direct language. Avoid using unnecessary words, such as \"please\". Correct Incorrect To create a ReplicaSet , ... In order to create a ReplicaSet , ... See the configuration file. Please see the configuration file. View the Pods. With this next command, we'll view the Pods.","title":"Use simple and direct language"},{"location":"help/style-guide/voice-and-language/#address-the-reader-as-you-not-we","text":"Correct Incorrect You can create a Deployment by ... We can create a Deployment by ... In the preceding output, you can see... In the preceding output, we can see ... This page teaches you how to use pods. In this page, we are going to learn about pods.","title":"Address the reader as \"you\", not \"we\""},{"location":"help/style-guide/voice-and-language/#avoid-jargon-idioms-and-latin","text":"Some readers speak English as a second language. Avoid jargon, idioms, and Latin to help make their understanding easier. Correct Incorrect Internally, ... Under the hood, ... Create a new cluster. Turn up a new cluster. Initially, ... Out of the box, ... For example, ... e.g., ... Enter through the gateway ... Enter via the gateway ...","title":"Avoid jargon, idioms, and Latin"},{"location":"help/style-guide/voice-and-language/#avoid-statements-about-the-future","text":"Avoid making promises or giving hints about the future. If you need to talk about a feature in development, add a boilerplate under the front matter that identifies the information accordingly.","title":"Avoid statements about the future"},{"location":"help/style-guide/voice-and-language/#avoid-statements-that-will-soon-be-out-of-date","text":"Avoid using wording that becomes outdated quickly like \"currently\" and \"new\". A feature that is new today is not new for long. Correct Incorrect In version 1.4, ... In the current version, ... The Federation feature provides ... The new Federation feature provides ...","title":"Avoid statements that will soon be out of date"},{"location":"help/style-guide/voice-and-language/#avoid-words-that-assume-a-specific-level-of-understanding","text":"Avoid words such as \"just\", \"simply\", \"easy\", \"easily\", or \"simple\". These words do not add value. Correct Incorrect Include one command in ... Include just one command in ... Run the container ... Simply run the container ... You can remove ... You can easily remove ... These steps ... These simple steps ...","title":"Avoid words that assume a specific level of understanding"},{"location":"modelserving/control_plane/","text":"Control Plane \u00b6 KServe Control Plane : Responsible for reconciling the InferenceService custom resources. It creates the Knative serverless deployment for predictor, transformer, explainer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received. When raw deployment mode is enabled, control plane creates Kubernetes deployment, service, ingress, HPA. Control Plane Components \u00b6 KServe Controller : Responsible for creating service, ingress resources, model server container and model agent container for request/response logging , batching and model pulling. Ingress Gateway : Gateway for routing external or internal requests. In Serverless Mode: Knative Serving Controller : Responsible for service revision management, creating network routing resources, serverless container with queue proxy to expose traffic metrics and enforce concurrency limit. Knative Activator : Brings back scaled-to-zero pods and forwards requests. Knative Autoscaler(KPA) : Watches traffic flow to the application, and scales replicas up or down based on configured metrics.","title":"Model Serving Control Plane"},{"location":"modelserving/control_plane/#control-plane","text":"KServe Control Plane : Responsible for reconciling the InferenceService custom resources. It creates the Knative serverless deployment for predictor, transformer, explainer to enable autoscaling based on incoming request workload including scaling down to zero when no traffic is received. When raw deployment mode is enabled, control plane creates Kubernetes deployment, service, ingress, HPA.","title":"Control Plane"},{"location":"modelserving/control_plane/#control-plane-components","text":"KServe Controller : Responsible for creating service, ingress resources, model server container and model agent container for request/response logging , batching and model pulling. Ingress Gateway : Gateway for routing external or internal requests. In Serverless Mode: Knative Serving Controller : Responsible for service revision management, creating network routing resources, serverless container with queue proxy to expose traffic metrics and enforce concurrency limit. Knative Activator : Brings back scaled-to-zero pods and forwards requests. Knative Autoscaler(KPA) : Watches traffic flow to the application, and scales replicas up or down based on configured metrics.","title":"Control Plane Components"},{"location":"modelserving/servingruntimes/","text":"Macro Syntax Error \u00b6 File : modelserving/servingruntimes.md Line 83 in Markdown file: unexpected '.' > **Note:** `ServingRuntimes` support the use of template variables of the form `{{.Variable}}` inside the container spec. These should map to fields inside an","title":"Serving Runtimes"},{"location":"modelserving/servingruntimes/#macro-syntax-error","text":"File : modelserving/servingruntimes.md Line 83 in Markdown file: unexpected '.' > **Note:** `ServingRuntimes` support the use of template variables of the form `{{.Variable}}` inside the container spec. These should map to fields inside an","title":"Macro Syntax Error"},{"location":"modelserving/autoscaling/autoscaling/","text":"Autoscale InferenceService with inference workload \u00b6 InferenceService with target concurrency \u00b6 Create InferenceService \u00b6 Apply the tensorflow example CR with scaling target set to 1. Annotation autoscaling.knative.dev/target is the soft limit rather than a strictly enforced limit, if there is sudden burst of the requests, this value can be exceeded. The scaleTarget and scaleMetric are introduced in version 0.9 of kserve and should be available in both new and old schema. This is the preferred way of defining autoscaling options. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale.yaml to create the Autoscale InferenceService. kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created Predict InferenceService with concurrent requests \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send traffic in 30 seconds spurts maintaining 5 in-flight requests. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0193 secs Slowest: 10 .1458 secs Fastest: 0 .0127 secs Average: 0 .0364 secs Requests/sec: 137 .4449 Total data: 1019122 bytes Size/request: 247 bytes Response time histogram: 0 .013 [ 1 ] | 1 .026 [ 4120 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .039 [ 0 ] | 3 .053 [ 0 ] | 4 .066 [ 0 ] | 5 .079 [ 0 ] | 6 .093 [ 0 ] | 7 .106 [ 0 ] | 8 .119 [ 0 ] | 9 .133 [ 0 ] | 10 .146 [ 5 ] | Latency distribution: 10 % in 0 .0178 secs 25 % in 0 .0188 secs 50 % in 0 .0199 secs 75 % in 0 .0210 secs 90 % in 0 .0231 secs 95 % in 0 .0328 secs 99 % in 0 .1501 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0002 secs, 0 .0127 secs, 10 .1458 secs DNS-lookup: 0 .0002 secs, 0 .0000 secs, 0 .1502 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0020 secs resp wait: 0 .0360 secs, 0 .0125 secs, 9 .9791 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4126 responses Check the number of running pods now, Kserve uses Knative Serving autoscaler which is based on the average number of in-flight requests per pod(concurrency). As the scaling target is set to 1 and we load the service with 5 concurrent requests, so the autoscaler tries scaling up to 5 pods. Notice that out of all the requests there are 5 requests on the histogram that take around 10s, that's the cold start time cost to initially spawn the pods and download model to be ready to serve. The cold start may take longer(to pull the serving image) if the image is not cached on the node that the pod is scheduled on. $ kubectl get pods NAME READY STATUS RESTARTS AGE flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3 /3 Running 0 42s flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3 /3 Running 0 64s flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3 /3 Running 0 42s Check Dashboard \u00b6 View the Knative Serving Scaling dashboards (if configured). kubectl kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 InferenceService with target QPS \u00b6 Create the InferenceService \u00b6 Apply the same tensorflow example CR kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created Predict InferenceService with target QPS \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 50 qps. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -q 50 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0264 secs Slowest: 10 .8113 secs Fastest: 0 .0145 secs Average: 0 .0731 secs Requests/sec: 683 .5644 Total data: 5069675 bytes Size/request: 247 bytes Response time histogram: 0 .014 [ 1 ] | 1 .094 [ 20474 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .174 [ 0 ] | 3 .254 [ 0 ] | 4 .333 [ 0 ] | 5 .413 [ 0 ] | 6 .493 [ 0 ] | 7 .572 [ 0 ] | 8 .652 [ 0 ] | 9 .732 [ 0 ] | 10 .811 [ 50 ] | Latency distribution: 10 % in 0 .0284 secs 25 % in 0 .0334 secs 50 % in 0 .0408 secs 75 % in 0 .0527 secs 90 % in 0 .0765 secs 95 % in 0 .0949 secs 99 % in 0 .1334 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0001 secs, 0 .0145 secs, 10 .8113 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0196 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs resp wait: 0 .0728 secs, 0 .0144 secs, 10 .7688 secs resp read: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs Status code distribution: [ 200 ] 20525 responses Check the number of running pods now, we are loading the service with 50 requests per second, and from the dashboard you can see that it hits the average concurrency 10 and autoscaler tries scaling up to 10 pods. Check Dashboard \u00b6 View the Knative Serving Scaling dashboards (if configured). kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 Autoscaler calculates average concurrency over 60 second window so it takes a minute to stabilize at the desired concurrency level, however it also calculates the 6 second panic window and will enter into panic mode if that window reaches 2x target concurrency. From the dashboard you can see that it enters panic mode in which autoscaler operates on shorter and more sensitive window. Once the panic conditions are no longer met for 60 seconds, autoscaler will return back to 60 seconds stable window. Autoscaling on GPU! \u00b6 Autoscaling on GPU is hard with GPU metrics, however thanks to Knative's concurrency based autoscaler scaling on GPU is pretty easy and effective! Create the InferenceService with GPU resource \u00b6 Apply the tensorflow gpu example CR New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 Apply the autoscale-gpu.yaml . kubectl kubectl apply -f autoscale-gpu.yaml Predict InferenceService with concurrent requests \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 5 in-flight requests. MODEL_NAME = flowers-sample-gpu INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0152 secs Slowest: 9 .7581 secs Fastest: 0 .0142 secs Average: 0 .0350 secs Requests/sec: 142 .9942 Total data: 948532 bytes Size/request: 221 bytes Response time histogram: 0 .014 [ 1 ] | 0 .989 [ 4286 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 1 .963 [ 0 ] | 2 .937 [ 0 ] | 3 .912 [ 0 ] | 4 .886 [ 0 ] | 5 .861 [ 0 ] | 6 .835 [ 0 ] | 7 .809 [ 0 ] | 8 .784 [ 0 ] | 9 .758 [ 5 ] | Latency distribution: 10 % in 0 .0181 secs 25 % in 0 .0189 secs 50 % in 0 .0198 secs 75 % in 0 .0210 secs 90 % in 0 .0230 secs 95 % in 0 .0276 secs 99 % in 0 .0511 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0000 secs, 0 .0142 secs, 9 .7581 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0291 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0023 secs resp wait: 0 .0348 secs, 0 .0141 secs, 9 .7158 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4292 responses Autoscaling Customization \u00b6 Autoscaling with ContainerConcurrency \u00b6 ContainerConcurrency determines the number of simultaneous requests that can be processed by each replica of the InferenceService at any given time, it is a hard limit and if the concurrency reaches the hard limit surplus requests will be buffered and must wait until enough capacity is free to execute the requests. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale-custom.yaml . kubectl kubectl apply -f autoscale-custom.yaml Enable scale down to zero \u00b6 KServe by default sets minReplicas to 1, if you want to enable scaling down to zero especially for use cases like serving on GPUs you can set minReplicas to 0 so that the pods automatically scale down to zero when no traffic is received. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the scale-down-to-zero.yaml . kubectl kubectl apply -f scale-down-to-zero.yaml Autoscaling configuration at component level \u00b6 Autoscaling options can also be configured at the component level. This allows more flexibility in terms of the autoscaling configuration. In a typical deployment, transformers may require a different autoscaling configuration than a predictor. This feature allows the user to scale individual components as required. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist Apply the autoscale-adv.yaml to create the Autoscale InferenceService. The default for scaleMetric is concurrency and possible values are concurrency , rps , cpu and memory .","title":"Inference Autoscaling"},{"location":"modelserving/autoscaling/autoscaling/#autoscale-inferenceservice-with-inference-workload","text":"","title":"Autoscale InferenceService with inference workload"},{"location":"modelserving/autoscaling/autoscaling/#inferenceservice-with-target-concurrency","text":"","title":"InferenceService with target concurrency"},{"location":"modelserving/autoscaling/autoscaling/#create-inferenceservice","text":"Apply the tensorflow example CR with scaling target set to 1. Annotation autoscaling.knative.dev/target is the soft limit rather than a strictly enforced limit, if there is sudden burst of the requests, this value can be exceeded. The scaleTarget and scaleMetric are introduced in version 0.9 of kserve and should be available in both new and old schema. This is the preferred way of defining autoscaling options. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale.yaml to create the Autoscale InferenceService. kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created","title":"Create InferenceService"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-concurrent-requests","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send traffic in 30 seconds spurts maintaining 5 in-flight requests. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0193 secs Slowest: 10 .1458 secs Fastest: 0 .0127 secs Average: 0 .0364 secs Requests/sec: 137 .4449 Total data: 1019122 bytes Size/request: 247 bytes Response time histogram: 0 .013 [ 1 ] | 1 .026 [ 4120 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .039 [ 0 ] | 3 .053 [ 0 ] | 4 .066 [ 0 ] | 5 .079 [ 0 ] | 6 .093 [ 0 ] | 7 .106 [ 0 ] | 8 .119 [ 0 ] | 9 .133 [ 0 ] | 10 .146 [ 5 ] | Latency distribution: 10 % in 0 .0178 secs 25 % in 0 .0188 secs 50 % in 0 .0199 secs 75 % in 0 .0210 secs 90 % in 0 .0231 secs 95 % in 0 .0328 secs 99 % in 0 .1501 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0002 secs, 0 .0127 secs, 10 .1458 secs DNS-lookup: 0 .0002 secs, 0 .0000 secs, 0 .1502 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0020 secs resp wait: 0 .0360 secs, 0 .0125 secs, 9 .9791 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4126 responses Check the number of running pods now, Kserve uses Knative Serving autoscaler which is based on the average number of in-flight requests per pod(concurrency). As the scaling target is set to 1 and we load the service with 5 concurrent requests, so the autoscaler tries scaling up to 5 pods. Notice that out of all the requests there are 5 requests on the histogram that take around 10s, that's the cold start time cost to initially spawn the pods and download model to be ready to serve. The cold start may take longer(to pull the serving image) if the image is not cached on the node that the pod is scheduled on. $ kubectl get pods NAME READY STATUS RESTARTS AGE flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3 /3 Running 0 42s flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3 /3 Running 0 62s flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3 /3 Running 0 64s flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3 /3 Running 0 42s","title":"Predict InferenceService with concurrent requests"},{"location":"modelserving/autoscaling/autoscaling/#check-dashboard","text":"View the Knative Serving Scaling dashboards (if configured). kubectl kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000","title":"Check Dashboard"},{"location":"modelserving/autoscaling/autoscaling/#inferenceservice-with-target-qps","text":"","title":"InferenceService with target QPS"},{"location":"modelserving/autoscaling/autoscaling/#create-the-inferenceservice","text":"Apply the same tensorflow example CR kubectl kubectl apply -f autoscale.yaml Expected Output $ inferenceservice.serving.kserve.io/flowers-sample created","title":"Create the InferenceService"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-target-qps","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 50 qps. MODEL_NAME = flowers-sample INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -q 50 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0264 secs Slowest: 10 .8113 secs Fastest: 0 .0145 secs Average: 0 .0731 secs Requests/sec: 683 .5644 Total data: 5069675 bytes Size/request: 247 bytes Response time histogram: 0 .014 [ 1 ] | 1 .094 [ 20474 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 2 .174 [ 0 ] | 3 .254 [ 0 ] | 4 .333 [ 0 ] | 5 .413 [ 0 ] | 6 .493 [ 0 ] | 7 .572 [ 0 ] | 8 .652 [ 0 ] | 9 .732 [ 0 ] | 10 .811 [ 50 ] | Latency distribution: 10 % in 0 .0284 secs 25 % in 0 .0334 secs 50 % in 0 .0408 secs 75 % in 0 .0527 secs 90 % in 0 .0765 secs 95 % in 0 .0949 secs 99 % in 0 .1334 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0001 secs, 0 .0145 secs, 10 .8113 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0196 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs resp wait: 0 .0728 secs, 0 .0144 secs, 10 .7688 secs resp read: 0 .0000 secs, 0 .0000 secs, 0 .0031 secs Status code distribution: [ 200 ] 20525 responses Check the number of running pods now, we are loading the service with 50 requests per second, and from the dashboard you can see that it hits the average concurrency 10 and autoscaler tries scaling up to 10 pods.","title":"Predict InferenceService with target QPS"},{"location":"modelserving/autoscaling/autoscaling/#check-dashboard_1","text":"View the Knative Serving Scaling dashboards (if configured). kubectl port-forward --namespace knative-monitoring $( kubectl get pods --namespace knative-monitoring --selector = app = grafana --output = jsonpath = \"{.items..metadata.name}\" ) 3000 Autoscaler calculates average concurrency over 60 second window so it takes a minute to stabilize at the desired concurrency level, however it also calculates the 6 second panic window and will enter into panic mode if that window reaches 2x target concurrency. From the dashboard you can see that it enters panic mode in which autoscaler operates on shorter and more sensitive window. Once the panic conditions are no longer met for 60 seconds, autoscaler will return back to 60 seconds stable window.","title":"Check Dashboard"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-on-gpu","text":"Autoscaling on GPU is hard with GPU metrics, however thanks to Knative's concurrency based autoscaler scaling on GPU is pretty easy and effective!","title":"Autoscaling on GPU!"},{"location":"modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource","text":"Apply the tensorflow gpu example CR New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" spec : predictor : scaleTarget : 1 scaleMetric : concurrency model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample-gpu\" annotations : autoscaling.knative.dev/target : \"1\" spec : predictor : tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" runtimeVersion : \"2.6.2-gpu\" resources : limits : nvidia.com/gpu : 1 Apply the autoscale-gpu.yaml . kubectl kubectl apply -f autoscale-gpu.yaml","title":"Create the InferenceService with GPU resource"},{"location":"modelserving/autoscaling/autoscaling/#predict-inferenceservice-with-concurrent-requests_1","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT Send 30 seconds of traffic maintaining 5 in-flight requests. MODEL_NAME = flowers-sample-gpu INPUT_PATH = input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice $MODEL_NAME -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 30s -c 5 -m POST -host ${ SERVICE_HOSTNAME } -D $INPUT_PATH http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict Expected Output Summary: Total: 30 .0152 secs Slowest: 9 .7581 secs Fastest: 0 .0142 secs Average: 0 .0350 secs Requests/sec: 142 .9942 Total data: 948532 bytes Size/request: 221 bytes Response time histogram: 0 .014 [ 1 ] | 0 .989 [ 4286 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 1 .963 [ 0 ] | 2 .937 [ 0 ] | 3 .912 [ 0 ] | 4 .886 [ 0 ] | 5 .861 [ 0 ] | 6 .835 [ 0 ] | 7 .809 [ 0 ] | 8 .784 [ 0 ] | 9 .758 [ 5 ] | Latency distribution: 10 % in 0 .0181 secs 25 % in 0 .0189 secs 50 % in 0 .0198 secs 75 % in 0 .0210 secs 90 % in 0 .0230 secs 95 % in 0 .0276 secs 99 % in 0 .0511 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0000 secs, 0 .0142 secs, 9 .7581 secs DNS-lookup: 0 .0000 secs, 0 .0000 secs, 0 .0291 secs req write: 0 .0000 secs, 0 .0000 secs, 0 .0023 secs resp wait: 0 .0348 secs, 0 .0141 secs, 9 .7158 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0021 secs Status code distribution: [ 200 ] 4292 responses","title":"Predict InferenceService with concurrent requests"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-customization","text":"","title":"Autoscaling Customization"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-with-containerconcurrency","text":"ContainerConcurrency determines the number of simultaneous requests that can be processed by each replica of the InferenceService at any given time, it is a hard limit and if the concurrency reaches the hard limit surplus requests will be buffered and must wait until enough capacity is free to execute the requests. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : containerConcurrency : 10 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the autoscale-custom.yaml . kubectl kubectl apply -f autoscale-custom.yaml","title":"Autoscaling with ContainerConcurrency"},{"location":"modelserving/autoscaling/autoscaling/#enable-scale-down-to-zero","text":"KServe by default sets minReplicas to 1, if you want to enable scaling down to zero especially for use cases like serving on GPUs you can set minReplicas to 0 so that the pods automatically scale down to zero when no traffic is received. New Schema Old Schema apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 model : modelFormat : name : tensorflow storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" apiVersion : \"serving.kserve.io/v1beta1\" kind : \"InferenceService\" metadata : name : \"flowers-sample\" spec : predictor : minReplicas : 0 tensorflow : storageUri : \"gs://kfserving-examples/models/tensorflow/flowers\" Apply the scale-down-to-zero.yaml . kubectl kubectl apply -f scale-down-to-zero.yaml","title":"Enable scale down to zero"},{"location":"modelserving/autoscaling/autoscaling/#autoscaling-configuration-at-component-level","text":"Autoscaling options can also be configured at the component level. This allows more flexibility in terms of the autoscaling configuration. In a typical deployment, transformers may require a different autoscaling configuration than a predictor. This feature allows the user to scale individual components as required. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : torch-transformer spec : predictor : scaleTarget : 2 scaleMetric : concurrency pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier transformer : scaleTarget : 8 scaleMetric : rps containers : - image : kserve/image-transformer:latest name : kserve-container command : - \"python\" - \"-m\" - \"model\" args : - --model_name - mnist Apply the autoscale-adv.yaml to create the Autoscale InferenceService. The default for scaleMetric is concurrency and possible values are concurrency , rps , cpu and memory .","title":"Autoscaling configuration at component level"},{"location":"modelserving/batcher/batcher/","text":"Inference Batcher \u00b6 This docs explains on how batch prediction for any ML frameworks (TensorFlow, PyTorch, ...) without decreasing the performance. This batcher is implemented in the KServe model agent sidecar, so the requests first hit the agent sidecar, when a batch prediction is triggered the request is then sent to the model server container for inference. We use webhook to inject the model agent container in the InferenceService pod to do the batching when batcher is enabled. We use go channels to transfer data between http request handler and batcher go routines. Currently we only implemented batching with KServe v1 HTTP protocol, gRPC is not supported yet. When the number of instances (For example, the number of pictures) reaches the maxBatchSize or the latency meets the maxLatency , a batch prediction will be triggered. Example \u00b6 We first create a pytorch predictor with a batcher. The maxLatency is set to a big value (500 milliseconds) to make us be able to observe the batching process. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 maxBatchSize : the max batch size for triggering a prediction. maxLatency : the max latency for triggering a prediction (In milliseconds). timeout : timeout of calling predictor service (In seconds). All of the bellowing fields have default values in the code. You can config them or not as you wish. maxBatchSize : 32. maxLatency : 500. timeout : 60. kubectl kubectl create -f pytorch-batcher.yaml We can now send requests to the pytorch model using hey. The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = mnist INPUT_PATH = @./input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice torchserve -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 10s -c 5 -m POST -host \" ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" -D ./input.json \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict\" The request will go to the model agent container first, the batcher in sidecar container batches the requests and send the inference request to the predictor container. Note If the interval of sending the two requests is less than maxLatency , the returned batchId will be the same. Expected Output Summary: Total: 10 .5361 secs Slowest: 0 .5759 secs Fastest: 0 .4983 secs Average: 0 .5265 secs Requests/sec: 9 .4912 Total data: 24100 bytes Size/request: 241 bytes Response time histogram: 0 .498 [ 1 ] | \u25a0 0 .506 [ 0 ] | 0 .514 [ 44 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .522 [ 21 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .529 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .537 [ 5 ] | \u25a0\u25a0\u25a0\u25a0\u25a0 0 .545 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .553 [ 0 ] | 0 .560 [ 7 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .568 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .576 [ 10 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 Latency distribution: 10 % in 0 .5100 secs 25 % in 0 .5118 secs 50 % in 0 .5149 secs 75 % in 0 .5406 secs 90 % in 0 .5706 secs 95 % in 0 .5733 secs 99 % in 0 .5759 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0004 secs, 0 .4983 secs, 0 .5759 secs DNS-lookup: 0 .0001 secs, 0 .0000 secs, 0 .0015 secs req write: 0 .0002 secs, 0 .0000 secs, 0 .0076 secs resp wait: 0 .5257 secs, 0 .4981 secs, 0 .5749 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0009 secs Status code distribution: [ 200 ] 100 responses","title":"Inference Batcher"},{"location":"modelserving/batcher/batcher/#inference-batcher","text":"This docs explains on how batch prediction for any ML frameworks (TensorFlow, PyTorch, ...) without decreasing the performance. This batcher is implemented in the KServe model agent sidecar, so the requests first hit the agent sidecar, when a batch prediction is triggered the request is then sent to the model server container for inference. We use webhook to inject the model agent container in the InferenceService pod to do the batching when batcher is enabled. We use go channels to transfer data between http request handler and batcher go routines. Currently we only implemented batching with KServe v1 HTTP protocol, gRPC is not supported yet. When the number of instances (For example, the number of pictures) reaches the maxBatchSize or the latency meets the maxLatency , a batch prediction will be triggered.","title":"Inference Batcher"},{"location":"modelserving/batcher/batcher/#example","text":"We first create a pytorch predictor with a batcher. The maxLatency is set to a big value (500 milliseconds) to make us be able to observe the batching process. New Schema Old Schema apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 model : modelFormat : name : pytorch storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 apiVersion : serving.kserve.io/v1beta1 kind : InferenceService metadata : name : \"torchserve\" spec : predictor : minReplicas : 1 timeout : 60 batcher : maxBatchSize : 32 maxLatency : 500 pytorch : storageUri : gs://kfserving-examples/models/torchserve/image_classifier/v1 maxBatchSize : the max batch size for triggering a prediction. maxLatency : the max latency for triggering a prediction (In milliseconds). timeout : timeout of calling predictor service (In seconds). All of the bellowing fields have default values in the code. You can config them or not as you wish. maxBatchSize : 32. maxLatency : 500. timeout : 60. kubectl kubectl create -f pytorch-batcher.yaml We can now send requests to the pytorch model using hey. The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = mnist INPUT_PATH = @./input.json SERVICE_HOSTNAME = $( kubectl get inferenceservice torchserve -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) hey -z 10s -c 5 -m POST -host \" ${ SERVICE_HOSTNAME } \" -H \"Content-Type: application/json\" -D ./input.json \"http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict\" The request will go to the model agent container first, the batcher in sidecar container batches the requests and send the inference request to the predictor container. Note If the interval of sending the two requests is less than maxLatency , the returned batchId will be the same. Expected Output Summary: Total: 10 .5361 secs Slowest: 0 .5759 secs Fastest: 0 .4983 secs Average: 0 .5265 secs Requests/sec: 9 .4912 Total data: 24100 bytes Size/request: 241 bytes Response time histogram: 0 .498 [ 1 ] | \u25a0 0 .506 [ 0 ] | 0 .514 [ 44 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .522 [ 21 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .529 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .537 [ 5 ] | \u25a0\u25a0\u25a0\u25a0\u25a0 0 .545 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .553 [ 0 ] | 0 .560 [ 7 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 0 .568 [ 4 ] | \u25a0\u25a0\u25a0\u25a0 0 .576 [ 10 ] | \u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0 Latency distribution: 10 % in 0 .5100 secs 25 % in 0 .5118 secs 50 % in 0 .5149 secs 75 % in 0 .5406 secs 90 % in 0 .5706 secs 95 % in 0 .5733 secs 99 % in 0 .5759 secs Details ( average, fastest, slowest ) : DNS+dialup: 0 .0004 secs, 0 .4983 secs, 0 .5759 secs DNS-lookup: 0 .0001 secs, 0 .0000 secs, 0 .0015 secs req write: 0 .0002 secs, 0 .0000 secs, 0 .0076 secs resp wait: 0 .5257 secs, 0 .4981 secs, 0 .5749 secs resp read: 0 .0001 secs, 0 .0000 secs, 0 .0009 secs Status code distribution: [ 200 ] 100 responses","title":"Example"},{"location":"modelserving/certificate/kserve/","text":"KServe with Self Signed Certificate Model Registry \u00b6 If you are using a model registry with a self-signed certificate, you must either skip ssl verify or apply the appropriate CA bundle to the storage-initializer to create a connection with the registry. This document explains three methods that can be used in KServe, described below: Configure CA bundle for storage-initializer Global configuration Namespace scope configuration(Using storage-config Secret) json annotation Skip SSL Verification (NOTE) This is only available for RawDeployment and ServerlessDeployment . For modelmesh, you should add ca bundle content into certificate parameter in storage-config Configure CA bundle for storage-initializer \u00b6 Global Configuration \u00b6 KServe use inferenceservice-config ConfigMap for default configuration. If you want to add cabundle cert for every inference service, you can set caBundleConfigMapName in the ConfigMap. Before updating the ConfigMap, you have to create a ConfigMap for CA bundle certificate in the namespace that KServe controller is running and the data key in the ConfigMap must be cabundle.crt . Create CA ConfigMap with the CA bundle cert kubectl create configmap cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: cabundle namespace: kserve Update inferenceservice-config ConfigMap storageInitializer: |- { ... \"caBundleConfigMapName\": \"cabundle\", ... } After you update this configuration, please restart KServe controller pod to pick up the change. When you create a inference service, then the ca bundle will be copied to your user namespace and it will be attached to the storage-initializer container. Using storage-config Secret \u00b6 If you want to apply the cabundle only to a specific inferenceservice, you can use a specific annotation or variable( cabundle_configmap ) on the storage-config Secret used by the inferenceservice. In this case, you have to create the cabundle ConfigMap in the user namespace before you create the inferenceservice. Create a ConfigMap with the cabundle cert kubectl create configmap local-cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: local-cabundle namespace: kserve-demo Add an annotation serving.kserve.io/s3-cabundle-configmap to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-cabundle-configmap: local-cabundle ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable cabundle_configmap to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", .... \"cabundle_configmap\": \"local-cabundle\" } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Skip SSL Verification \u00b6 For testing purposes or when there is no cabundle, you can easily create an SSL connection by disabling SSL verification. This can also be used by adding an annotation or setting a variable in secret-config Secret. Add an annotation( serving.kserve.io/s3-verifyssl ) to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-verifyssl: \"0\" # 1 is true, 0 is false ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable ( verify_ssl ) to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", ... \"verify_ssl\": \"0\" # 1 is true, 0 is false (You can set True/true/False/false too) } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Full Demo Scripts","title":"CA Certificate"},{"location":"modelserving/certificate/kserve/#kserve-with-self-signed-certificate-model-registry","text":"If you are using a model registry with a self-signed certificate, you must either skip ssl verify or apply the appropriate CA bundle to the storage-initializer to create a connection with the registry. This document explains three methods that can be used in KServe, described below: Configure CA bundle for storage-initializer Global configuration Namespace scope configuration(Using storage-config Secret) json annotation Skip SSL Verification (NOTE) This is only available for RawDeployment and ServerlessDeployment . For modelmesh, you should add ca bundle content into certificate parameter in storage-config","title":"KServe with Self Signed Certificate Model Registry"},{"location":"modelserving/certificate/kserve/#configure-ca-bundle-for-storage-initializer","text":"","title":"Configure CA bundle for storage-initializer"},{"location":"modelserving/certificate/kserve/#global-configuration","text":"KServe use inferenceservice-config ConfigMap for default configuration. If you want to add cabundle cert for every inference service, you can set caBundleConfigMapName in the ConfigMap. Before updating the ConfigMap, you have to create a ConfigMap for CA bundle certificate in the namespace that KServe controller is running and the data key in the ConfigMap must be cabundle.crt . Create CA ConfigMap with the CA bundle cert kubectl create configmap cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: cabundle namespace: kserve Update inferenceservice-config ConfigMap storageInitializer: |- { ... \"caBundleConfigMapName\": \"cabundle\", ... } After you update this configuration, please restart KServe controller pod to pick up the change. When you create a inference service, then the ca bundle will be copied to your user namespace and it will be attached to the storage-initializer container.","title":"Global Configuration"},{"location":"modelserving/certificate/kserve/#using-storage-config-secret","text":"If you want to apply the cabundle only to a specific inferenceservice, you can use a specific annotation or variable( cabundle_configmap ) on the storage-config Secret used by the inferenceservice. In this case, you have to create the cabundle ConfigMap in the user namespace before you create the inferenceservice. Create a ConfigMap with the cabundle cert kubectl create configmap local-cabundle --from-file=/path/to/cabundle.crt kubectl get configmap cabundle -o yaml apiVersion: v1 data: cabundle.crt: XXXXX kind: ConfigMap metadata: name: local-cabundle namespace: kserve-demo Add an annotation serving.kserve.io/s3-cabundle-configmap to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-cabundle-configmap: local-cabundle ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable cabundle_configmap to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", .... \"cabundle_configmap\": \"local-cabundle\" } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque","title":"Using storage-config Secret"},{"location":"modelserving/certificate/kserve/#skip-ssl-verification","text":"For testing purposes or when there is no cabundle, you can easily create an SSL connection by disabling SSL verification. This can also be used by adding an annotation or setting a variable in secret-config Secret. Add an annotation( serving.kserve.io/s3-verifyssl ) to storage-config Secret apiVersion: v1 data: AWS_ACCESS_KEY_ID: VEhFQUNDRVNTS0VZ AWS_SECRET_ACCESS_KEY: VEhFUEFTU1dPUkQ= kind: Secret metadata: annotations: serving.kserve.io/s3-verifyssl: \"0\" # 1 is true, 0 is false ... name: storage-config namespace: kserve-demo type: Opaque Or, set a variable ( verify_ssl ) to storage-config Secret apiVersion: v1 stringData: localMinIO: | { \"type\": \"s3\", ... \"verify_ssl\": \"0\" # 1 is true, 0 is false (You can set True/true/False/false too) } kind: Secret metadata: name: storage-config namespace: kserve-demo type: Opaque Full Demo Scripts","title":"Skip SSL Verification"},{"location":"modelserving/data_plane/data_plane/","text":"Data Plane \u00b6 The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together. Introduction \u00b6 KServe's data plane protocol introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and TorchServe. Note: Protocol V2 uses /infer instead of :predict Concepts \u00b6 Component : Each endpoint is composed of multiple components: \"predictor\", \"explainer\", and \"transformer\". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. Predictor : The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. Explainer : The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. Transformer : The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast. Data Plane V1 & V2 \u00b6 KServe supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs. Main changes \u00b6 V2 does not currently support the explain endpoint V2 added Server Readiness/Liveness/Metadata endpoints V2 endpoint paths contain / instead of : V2 renamed :predict endpoint to /infer V2 allows for model versions in the request path (optional) V1 APIs \u00b6 API Verb Path List Models GET /v1/models Model Ready GET /v1/models/ Predict POST /v1/models/:predict Explain POST /v1/models/:explain V2 APIs \u00b6 API Verb Path Inference POST v2/models/[/versions/]/infer Model Metadata GET v2/models/[/versions/] Server Readiness GET v2/health/ready Server Liveness GET v2/health/live Server Metadata GET v2 Model Readiness GET v2/models/[/versions/ ]/ready ** path contents in [] are optional Please see V1 Protocol and V2 Protocol documentation for more information.","title":"Model Serving Data Plane"},{"location":"modelserving/data_plane/data_plane/#data-plane","text":"The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together.","title":"Data Plane"},{"location":"modelserving/data_plane/data_plane/#introduction","text":"KServe's data plane protocol introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and TorchServe. Note: Protocol V2 uses /infer instead of :predict","title":"Introduction"},{"location":"modelserving/data_plane/data_plane/#concepts","text":"Component : Each endpoint is composed of multiple components: \"predictor\", \"explainer\", and \"transformer\". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. Predictor : The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. Explainer : The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. Transformer : The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast.","title":"Concepts"},{"location":"modelserving/data_plane/data_plane/#data-plane-v1-v2","text":"KServe supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs.","title":"Data Plane V1 & V2"},{"location":"modelserving/data_plane/data_plane/#main-changes","text":"V2 does not currently support the explain endpoint V2 added Server Readiness/Liveness/Metadata endpoints V2 endpoint paths contain / instead of : V2 renamed :predict endpoint to /infer V2 allows for model versions in the request path (optional)","title":"Main changes"},{"location":"modelserving/data_plane/data_plane/#v1-apis","text":"API Verb Path List Models GET /v1/models Model Ready GET /v1/models/ Predict POST /v1/models/:predict Explain POST /v1/models/:explain","title":"V1 APIs"},{"location":"modelserving/data_plane/data_plane/#v2-apis","text":"API Verb Path Inference POST v2/models/[/versions/]/infer Model Metadata GET v2/models/[/versions/] Server Readiness GET v2/health/ready Server Liveness GET v2/health/live Server Metadata GET v2 Model Readiness GET v2/models/[/versions/ ]/ready ** path contents in [] are optional Please see V1 Protocol and V2 Protocol documentation for more information.","title":"V2 APIs"},{"location":"modelserving/data_plane/v1_protocol/","text":"Data Plane (V1) \u00b6 KServe's V1 protocol offers a standardized prediction workflow across all model frameworks. This protocol version is still supported, but it is recommended that users migrate to the V2 protocol for better performance and standardization among serving runtimes. However, if a use case requires a more flexible schema than protocol v2 provides, v1 protocol is still an option. API Verb Path Request Payload Response Payload List Models GET /v1/models {\"models\": []} Model Ready GET /v1/models/ {\"name\": ,\"ready\": $bool} Predict POST /v1/models/:predict {\"instances\": []} ** {\"predictions\": []} Explain POST /v1/models/:explain {\"instances\": []} ** {\"predictions\": [], \"explanations\": []} ** = payload is optional Note: The response payload in V1 protocol is not strictly enforced. A custom server can define and return its own response payload. We encourage using the KServe defined response payload for consistency. API Definitions \u00b6 API Definition Predict The \"predict\" API performs inference on a model. The response is the prediction result. All InferenceServices speak the Tensorflow V1 HTTP API . Explain The \"explain\" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an \":explain\" verb. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). List Models The \"models\" API exposes a list of models in the model registry.","title":"V1 Inference Protocol"},{"location":"modelserving/data_plane/v1_protocol/#data-plane-v1","text":"KServe's V1 protocol offers a standardized prediction workflow across all model frameworks. This protocol version is still supported, but it is recommended that users migrate to the V2 protocol for better performance and standardization among serving runtimes. However, if a use case requires a more flexible schema than protocol v2 provides, v1 protocol is still an option. API Verb Path Request Payload Response Payload List Models GET /v1/models {\"models\": []} Model Ready GET /v1/models/ {\"name\": ,\"ready\": $bool} Predict POST /v1/models/:predict {\"instances\": []} ** {\"predictions\": []} Explain POST /v1/models/:explain {\"instances\": []} ** {\"predictions\": [], \"explanations\": []} ** = payload is optional Note: The response payload in V1 protocol is not strictly enforced. A custom server can define and return its own response payload. We encourage using the KServe defined response payload for consistency.","title":"Data Plane (V1)"},{"location":"modelserving/data_plane/v1_protocol/#api-definitions","text":"API Definition Predict The \"predict\" API performs inference on a model. The response is the prediction result. All InferenceServices speak the Tensorflow V1 HTTP API . Explain The \"explain\" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an \":explain\" verb. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). List Models The \"models\" API exposes a list of models in the model registry.","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/","text":"Open Inference Protocol (V2 Inference Protocol) \u00b6 For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs . Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the HTTP/REST API and/or the GRPC API . Check the model serving runtime table / the protocolVersion field in the runtime YAML to ensure V2 protocol is supported for model serving runtime that you are using. Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately. Note on changes between V1 & V2 \u00b6 V2 protocol does not currently support the explain endpoint like V1 protocol does. If this is a feature you wish to have in the V2 protocol, please submit a github issue . HTTP/REST \u00b6 The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. See also: The HTTP/REST endpoints are defined in rest_predict_v2.yaml API Verb Path Request Payload Response Payload Inference POST v2/models/ [/versions/]/infer $inference_request $inference_response Model Metadata GET v2/models/[/versions/] $metadata_model_response Server Ready GET v2/health/ready $ready_server_response Server Live GET v2/health/live $live_server_response Server Metadata GET v2 $metadata_server_response Model Ready GET v2/models/[/versions/ ]/ready $ready_model_response ** path contents in [] are optional For more information regarding payload contents, see Payload Contents . The versions portion of the Path URLs (in [] ) is shown as optional to allow implementations that don\u2019t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). For example, if a model does not implement a version, the Model Metadata request path could look like v2/model/my_model . If the model has been configured to implement a version, the request path could look something like v2/models/my_model/versions/v10 , where the version of the model is v10. API Definitions \u00b6 API Definition Inference The /infer endpoint performs inference on a model. The response is the prediction result. Model Metadata The \"model metadata\" API is a per-model endpoint that returns details about the model passed in the path. Server Ready The \u201cserver ready\u201d health API indicates if all the models are ready for inferencing. The \u201cserver ready\u201d health API can be used directly to implement the Kubernetes readinessProbe Server Live The \u201cserver live\u201d health API indicates if the inference server is able to receive and respond to metadata and inference requests. The \u201cserver live\u201d API can be used directly to implement the Kubernetes livenessProbe. Server Metadata The \"server metadata\" API returns details describing the server. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. Health/Readiness/Liveness Probes \u00b6 The Model Readiness probe the question \"Did the model download and is it able to serve requests?\" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question \"Is my service and its infrastructure running, healthy, and able to receive and process requests?\" To read more about liveness and readiness probe concepts, visit the Configure Liveness, Readiness and Startup Probes Kubernetes documentation. Payload Contents \u00b6 Model Ready \u00b6 The model ready endpoint returns the readiness probe response for the server along with the name of the model. Model Ready Response JSON Object \u00b6 $ready_model_response = { \"name\" : $string, \"ready\": $bool } Server Ready \u00b6 The server ready endpoint returns the readiness probe response for the server. Server Ready Response JSON Object \u00b6 $ready_server_response = { \"live\" : $bool, } Server Live \u00b6 The server live endpoint returns the liveness probe response for the server. Server Live Response JSON Objet \u00b6 $live_server_response = { \"live\" : $bool, } Server Metadata \u00b6 The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object . Server Metadata Response JSON Object \u00b6 A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response , is returned in the HTTP body. $metadata_server_response = { \"name\" : $string, \"version\" : $string, \"extensions\" : [ $string, ... ] } \u201cname\u201d : A descriptive name for the server. \"version\" : The server version. \u201cextensions\u201d : The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions. Server Metadata Response JSON Error Object \u00b6 A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object. $metadata_server_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. Model Metadata \u00b6 The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. Model Metadata Response JSON Object \u00b6 A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response , is returned in the HTTP body for every successful model metadata request. $metadata_model_response = { \"name\" : $string, \"versions\" : [ $string, ... ] #optional, \"platform\" : $string, \"inputs\" : [ $metadata_tensor, ... ], \"outputs\" : [ $metadata_tensor, ... ] } \u201cname\u201d : The name of the model. \"versions\" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don\u2019t support versions. Optional for models that don\u2019t allow a version to be explicitly requested. \u201cplatform\u201d : The framework/backend for the model. See Platforms . \u201cinputs\u201d : The inputs required by the model. \u201coutputs\u201d : The outputs produced by the model. Each model input and output tensors\u2019 metadata is described with a $metadata_tensor object . $metadata_tensor = { \"name\" : $string, \"datatype\" : $string, \"shape\" : [ $number, ... ] } \u201cname\u201d : The name of the tensor. \"datatype\" : The data-type of the tensor elements as defined in Tensor Data Types . \"shape\" : The shape of the tensor. Variable-size dimensions are specified as -1. Model Metadata Response JSON Error Object \u00b6 A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object. $metadata_model_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. Inference \u00b6 An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object . In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object . See Inference Request Examples for some example HTTP/REST requests and responses. Inference Request JSON Object \u00b6 The inference request object, identified as $inference_request , is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. $inference_request = { \"id\" : $string #optional, \"parameters\" : $parameters #optional, \"inputs\" : [ $request_input, ... ], \"outputs\" : [ $request_output, ... ] #optional } \"id\" : An identifier for this request. Optional, but if specified this identifier must be returned in the response. \"parameters\" : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information. \"inputs\" : The input tensors. Each input is described using the $request_input schema defined in Request Input . \"outputs\" : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output . Optional, if not specified all outputs produced by the model will be returned using default $request_output settings. Request Input \u00b6 The $inference_request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch. $inference_request_input = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the input tensor. \"shape\" : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the input tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information. Request Output \u00b6 The $request_output JSON is used to request which output tensors should be returned from the model. $inference_request_output = { \"name\" : $string, \"parameters\" : $parameters #optional, } \"name\" : The name of the output tensor. \"parameters\" : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information. Inference Response JSON Object \u00b6 A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response , is returned in the HTTP body. $inference_response = { \"model_name\" : $string, \"model_version\" : $string #optional, \"id\" : $string, \"parameters\" : $parameters #optional, \"outputs\" : [ $response_output, ... ] } \"model_name\" : The name of the model used for inference. \"model_version\" : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response. \"id\" : The \"id\" identifier given in the request, if any. \"parameters\" : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information. \"outputs\" : The output tensors. Each output is described using the $response_output schema defined in Response Output . Response Output \u00b6 The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch. $response_output = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the output tensor. \"shape\" : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the output tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information. Inference Response JSON Error Object \u00b6 A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object. $inference_error_response = { \"error\": } \u201cerror\u201d : The descriptive message for the error. Parameters \u00b6 The $parameters JSON describes zero or more \u201cname\u201d/\u201dvalue\u201d pairs, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a $string, $number, or $boolean. $parameters = { $parameter, ... } $parameter = $string : $string | $number | $boolean Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. Tensor Data \u00b6 Tensor data must be presented in row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation. Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 and bf16 are problematic to communicate explicitly since there is not a standard fp16/bf16 representation across backends nor typically the programmatic support to create the fp16/bf16 representation for a JSON number. For example, the 2-dimensional matrix: [ 1 2 4 5 ] Can be represented in its natural format as: \"data\" : [ [ 1, 2 ], [ 4, 5 ] ] Or in a flattened one-dimensional representation: \"data\" : [ 1, 2, 4, 5 ] Tensor Data Types \u00b6 Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 ) --- Inference Request Examples \u00b6 The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object. POST /v2/models/mymodel/infer HTTP/1.1 Host: localhost:8000 Content-Type: application/json Content-Length: { \"id\" : \"42\", \"inputs\" : [ { \"name\" : \"input0\", \"shape\" : [ 2, 2 ], \"datatype\" : \"UINT32\", \"data\" : [ 1, 2, 3, 4 ] }, { \"name\" : \"input1\", \"shape\" : [ 3 ], \"datatype\" : \"BOOL\", \"data\" : [ true ] } ], \"outputs\" : [ { \"name\" : \"output0\" } ] } For the above request the inference server must return the \u201coutput0\u201d output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned. HTTP/1.1 200 OK Content-Type: application/json Content-Length: { \"id\" : \"42\" \"outputs\" : [ { \"name\" : \"output0\", \"shape\" : [ 3, 2 ], \"datatype\" : \"FP32\", \"data\" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ] } ] } gRPC \u00b6 The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section. API rpc Endpoint Request Message Response Message Inference ModelInfer ModelInferRequest ModelInferResponse Model Ready ModelReady [ModelReadyRequest] ModelReadyResponse Model Metadata ModelMetadata ModelMetadataRequest ModelMetadataResponse Server Ready ServerReady ServerReadyRequest ServerReadyResponse Server Live ServerLive ServerLiveRequest ServerLiveResponse For more detailed information on each endpoint and its contents, see API Definitions and Message Contents . See also: The gRPC endpoints, request/response messages and contents are defined in grpc_predict_v2.proto API Definitions \u00b6 The GRPC definition of the service is: // // Inference Server GRPC endpoints. // service GRPCInferenceService { // Check liveness of the inference server. rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {} // Check readiness of the inference server. rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {} // Check readiness of a model in the inference server. rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {} // Get server metadata. rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {} // Get model metadata. rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {} // Perform inference using a specific model. rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} } Message Contents \u00b6 Health \u00b6 A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. Server Live \u00b6 The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are: message ServerLiveRequest {} message ServerLiveResponse { // True if the inference server is live, false if not live. bool live = 1; } Server Ready \u00b6 The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are: message ServerReadyRequest {} message ServerReadyResponse { // True if the inference server is ready, false if not ready. bool ready = 1; } Model Ready \u00b6 The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are: message ModelReadyRequest { // The name of the model to check for readiness. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelReadyResponse { // True if the model is ready, false if not ready. bool ready = 1; } Metadata \u00b6 Server Metadata \u00b6 The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: message ServerMetadataRequest {} message ServerMetadataResponse { // The server name. string name = 1; // The server version. string version = 2; // The extensions supported by the server. repeated string extensions = 3; } Model Metadata \u00b6 The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are: message ModelMetadataRequest { // The name of the model. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelMetadataResponse { // Metadata for a tensor. message TensorMetadata { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. A variable-size dimension is represented // by a -1 value. repeated int64 shape = 3; } // The model name. string name = 1; // The versions of the model available on the server. repeated string versions = 2; // The model's platform. See Platforms. string platform = 3; // The model's inputs. repeated TensorMetadata inputs = 4; // The model's outputs. repeated TensorMetadata outputs = 5; } Platforms \u00b6 A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format \u201c _ \u201d. The following platform names are allowed: tensorrt_plan : A TensorRT model encoded as a serialized engine or \u201cplan\u201d. tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. pytorch_torchscript : A PyTorch model encoded as TorchScript. mxnet_mxnet: An MXNet model caffe2_netdef : A Caffe2 model encoded as a NetDef. Inference \u00b6 The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are: message ModelInferRequest { // An input tensor for an inference request. message InferInputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional inference input tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference request. InferTensorContents contents = 5; } // An output tensor requested for an inference request. message InferRequestedOutputTensor { // The tensor name. string name = 1; // Optional requested output tensor parameters. map parameters = 2; } // The name of the model to use for inferencing. string model_name = 1; // The version of the model to use for inference. If not given the // server will choose a version based on the model and internal policy. string model_version = 2; // Optional identifier for the request. If specified will be // returned in the response. string id = 3; // Optional inference parameters. map parameters = 4; // The input tensors for the inference. repeated InferInputTensor inputs = 5; // The requested output tensors for the inference. Optional, if not // specified all outputs produced by the model will be returned. repeated InferRequestedOutputTensor outputs = 6; // The data contained in an input tensor can be represented in \"raw\" // bytes form or in the repeated type that matches the tensor's data // type. To use the raw representation 'raw_input_contents' must be // initialized with data for each tensor in the same order as // 'inputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferInputTensor::contents must // not be specified for any input tensor. repeated bytes raw_input_contents = 7; } message ModelInferResponse { // An output tensor returned for an inference request. message InferOutputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional output tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference response. InferTensorContents contents = 5; } // The name of the model used for inference. string model_name = 1; // The version of the model used for inference. string model_version = 2; // The id of the inference request if one was specified. string id = 3; // Optional inference response parameters. map parameters = 4; // The output tensors holding inference results. repeated InferOutputTensor outputs = 5; // The data contained in an output tensor can be represented in // \"raw\" bytes form or in the repeated type that matches the // tensor's data type. To use the raw representation 'raw_output_contents' // must be initialized with data for each tensor in the same order as // 'outputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferOutputTensor::contents must // not be specified for any output tensor. repeated bytes raw_output_contents = 6; } Parameters \u00b6 The Parameters message describes a \u201cname\u201d/\u201dvalue\u201d pair, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a boolean, integer, or string corresponding to the parameter. Currently, no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. // // An inference parameter value. // message InferParameter { // The parameter value can be a string, an int64, a boolean // or a message specific to a predefined parameter. oneof parameter_choice { // A boolean parameter value. bool bool_param = 1; // An int64 parameter value. int64 int64_param = 2; // A string parameter value. string string_param = 3; } } Tensor Data \u00b6 In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Using a \"raw\" representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231. An alternative to the \"raw\" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type. // // The data contained in a tensor represented by the repeated type // that matches the tensor's data type. Protobuf oneof is not used // because oneofs cannot contain repeated fields. // message InferTensorContents { // Representation for BOOL data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bool bool_contents = 1; // Representation for INT8, INT16, and INT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated int32 int_contents = 2; // Representation for INT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated int64 int64_contents = 3; // Representation for UINT8, UINT16, and UINT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated uint32 uint_contents = 4; // Representation for UINT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated uint64 uint64_contents = 5; // Representation for FP32 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated float fp32_contents = 6; // Representation for FP64 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated double fp64_contents = 7; // Representation for BYTES data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bytes bytes_contents = 8; } Tensor Data Types \u00b6 Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 )","title":"Open Inference Protocol (V2 Inference Protocol)"},{"location":"modelserving/data_plane/v2_protocol/#open-inference-protocol-v2-inference-protocol","text":"For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs . Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the HTTP/REST API and/or the GRPC API . Check the model serving runtime table / the protocolVersion field in the runtime YAML to ensure V2 protocol is supported for model serving runtime that you are using. Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.","title":"Open Inference Protocol (V2 Inference Protocol)"},{"location":"modelserving/data_plane/v2_protocol/#note-on-changes-between-v1-v2","text":"V2 protocol does not currently support the explain endpoint like V1 protocol does. If this is a feature you wish to have in the V2 protocol, please submit a github issue .","title":"Note on changes between V1 & V2"},{"location":"modelserving/data_plane/v2_protocol/#httprest","text":"The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. See also: The HTTP/REST endpoints are defined in rest_predict_v2.yaml API Verb Path Request Payload Response Payload Inference POST v2/models/ [/versions/]/infer $inference_request $inference_response Model Metadata GET v2/models/[/versions/] $metadata_model_response Server Ready GET v2/health/ready $ready_server_response Server Live GET v2/health/live $live_server_response Server Metadata GET v2 $metadata_server_response Model Ready GET v2/models/[/versions/ ]/ready $ready_model_response ** path contents in [] are optional For more information regarding payload contents, see Payload Contents . The versions portion of the Path URLs (in [] ) is shown as optional to allow implementations that don\u2019t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). For example, if a model does not implement a version, the Model Metadata request path could look like v2/model/my_model . If the model has been configured to implement a version, the request path could look something like v2/models/my_model/versions/v10 , where the version of the model is v10.","title":"HTTP/REST"},{"location":"modelserving/data_plane/v2_protocol/#api-definitions","text":"API Definition Inference The /infer endpoint performs inference on a model. The response is the prediction result. Model Metadata The \"model metadata\" API is a per-model endpoint that returns details about the model passed in the path. Server Ready The \u201cserver ready\u201d health API indicates if all the models are ready for inferencing. The \u201cserver ready\u201d health API can be used directly to implement the Kubernetes readinessProbe Server Live The \u201cserver live\u201d health API indicates if the inference server is able to receive and respond to metadata and inference requests. The \u201cserver live\u201d API can be used directly to implement the Kubernetes livenessProbe. Server Metadata The \"server metadata\" API returns details describing the server. Model Ready The \u201cmodel ready\u201d health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL.","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/#healthreadinessliveness-probes","text":"The Model Readiness probe the question \"Did the model download and is it able to serve requests?\" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question \"Is my service and its infrastructure running, healthy, and able to receive and process requests?\" To read more about liveness and readiness probe concepts, visit the Configure Liveness, Readiness and Startup Probes Kubernetes documentation.","title":"Health/Readiness/Liveness Probes"},{"location":"modelserving/data_plane/v2_protocol/#payload-contents","text":"","title":"Payload Contents"},{"location":"modelserving/data_plane/v2_protocol/#model-ready","text":"The model ready endpoint returns the readiness probe response for the server along with the name of the model.","title":"Model Ready"},{"location":"modelserving/data_plane/v2_protocol/#model-ready-response-json-object","text":"$ready_model_response = { \"name\" : $string, \"ready\": $bool }","title":"Model Ready Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-ready","text":"The server ready endpoint returns the readiness probe response for the server.","title":"Server Ready"},{"location":"modelserving/data_plane/v2_protocol/#server-ready-response-json-object","text":"$ready_server_response = { \"live\" : $bool, }","title":"Server Ready Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-live","text":"The server live endpoint returns the liveness probe response for the server.","title":"Server Live"},{"location":"modelserving/data_plane/v2_protocol/#server-live-response-json-objet","text":"$live_server_response = { \"live\" : $bool, }","title":"Server Live Response JSON Objet"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata","text":"The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object .","title":"Server Metadata"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata-response-json-object","text":"A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response , is returned in the HTTP body. $metadata_server_response = { \"name\" : $string, \"version\" : $string, \"extensions\" : [ $string, ... ] } \u201cname\u201d : A descriptive name for the server. \"version\" : The server version. \u201cextensions\u201d : The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions.","title":"Server Metadata Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata-response-json-error-object","text":"A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object. $metadata_server_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error. The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.","title":"Server Metadata Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata","text":"The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object . The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.","title":"Model Metadata"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata-response-json-object","text":"A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response , is returned in the HTTP body for every successful model metadata request. $metadata_model_response = { \"name\" : $string, \"versions\" : [ $string, ... ] #optional, \"platform\" : $string, \"inputs\" : [ $metadata_tensor, ... ], \"outputs\" : [ $metadata_tensor, ... ] } \u201cname\u201d : The name of the model. \"versions\" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don\u2019t support versions. Optional for models that don\u2019t allow a version to be explicitly requested. \u201cplatform\u201d : The framework/backend for the model. See Platforms . \u201cinputs\u201d : The inputs required by the model. \u201coutputs\u201d : The outputs produced by the model. Each model input and output tensors\u2019 metadata is described with a $metadata_tensor object . $metadata_tensor = { \"name\" : $string, \"datatype\" : $string, \"shape\" : [ $number, ... ] } \u201cname\u201d : The name of the tensor. \"datatype\" : The data-type of the tensor elements as defined in Tensor Data Types . \"shape\" : The shape of the tensor. Variable-size dimensions are specified as -1.","title":"Model Metadata Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata-response-json-error-object","text":"A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object. $metadata_model_error_response = { \"error\": $string } \u201cerror\u201d : The descriptive message for the error.","title":"Model Metadata Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#inference","text":"An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object . In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object . See Inference Request Examples for some example HTTP/REST requests and responses.","title":"Inference"},{"location":"modelserving/data_plane/v2_protocol/#inference-request-json-object","text":"The inference request object, identified as $inference_request , is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. $inference_request = { \"id\" : $string #optional, \"parameters\" : $parameters #optional, \"inputs\" : [ $request_input, ... ], \"outputs\" : [ $request_output, ... ] #optional } \"id\" : An identifier for this request. Optional, but if specified this identifier must be returned in the response. \"parameters\" : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information. \"inputs\" : The input tensors. Each input is described using the $request_input schema defined in Request Input . \"outputs\" : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output . Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.","title":"Inference Request JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#request-input","text":"The $inference_request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch. $inference_request_input = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the input tensor. \"shape\" : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the input tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information.","title":"Request Input"},{"location":"modelserving/data_plane/v2_protocol/#request-output","text":"The $request_output JSON is used to request which output tensors should be returned from the model. $inference_request_output = { \"name\" : $string, \"parameters\" : $parameters #optional, } \"name\" : The name of the output tensor. \"parameters\" : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.","title":"Request Output"},{"location":"modelserving/data_plane/v2_protocol/#inference-response-json-object","text":"A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response , is returned in the HTTP body. $inference_response = { \"model_name\" : $string, \"model_version\" : $string #optional, \"id\" : $string, \"parameters\" : $parameters #optional, \"outputs\" : [ $response_output, ... ] } \"model_name\" : The name of the model used for inference. \"model_version\" : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response. \"id\" : The \"id\" identifier given in the request, if any. \"parameters\" : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information. \"outputs\" : The output tensors. Each output is described using the $response_output schema defined in Response Output .","title":"Inference Response JSON Object"},{"location":"modelserving/data_plane/v2_protocol/#response-output","text":"The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch. $response_output = { \"name\" : $string, \"shape\" : [ $number, ... ], \"datatype\" : $string, \"parameters\" : $parameters #optional, \"data\" : $tensor_data } \"name\" : The name of the output tensor. \"shape\" : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value. \"datatype\" : The data-type of the output tensor elements as defined in Tensor Data Types . \"parameters\" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information. \u201cdata\u201d: The contents of the tensor. See Tensor Data for more information.","title":"Response Output"},{"location":"modelserving/data_plane/v2_protocol/#inference-response-json-error-object","text":"A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object. $inference_error_response = { \"error\": } \u201cerror\u201d : The descriptive message for the error.","title":"Inference Response JSON Error Object"},{"location":"modelserving/data_plane/v2_protocol/#parameters","text":"The $parameters JSON describes zero or more \u201cname\u201d/\u201dvalue\u201d pairs, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a $string, $number, or $boolean. $parameters = { $parameter, ... } $parameter = $string : $string | $number | $boolean Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.","title":"Parameters"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data","text":"Tensor data must be presented in row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation. Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 and bf16 are problematic to communicate explicitly since there is not a standard fp16/bf16 representation across backends nor typically the programmatic support to create the fp16/bf16 representation for a JSON number. For example, the 2-dimensional matrix: [ 1 2 4 5 ] Can be represented in its natural format as: \"data\" : [ [ 1, 2 ], [ 4, 5 ] ] Or in a flattened one-dimensional representation: \"data\" : [ 1, 2, 4, 5 ]","title":"Tensor Data"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data-types","text":"Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 ) ---","title":"Tensor Data Types"},{"location":"modelserving/data_plane/v2_protocol/#inference-request-examples","text":"The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object. POST /v2/models/mymodel/infer HTTP/1.1 Host: localhost:8000 Content-Type: application/json Content-Length: { \"id\" : \"42\", \"inputs\" : [ { \"name\" : \"input0\", \"shape\" : [ 2, 2 ], \"datatype\" : \"UINT32\", \"data\" : [ 1, 2, 3, 4 ] }, { \"name\" : \"input1\", \"shape\" : [ 3 ], \"datatype\" : \"BOOL\", \"data\" : [ true ] } ], \"outputs\" : [ { \"name\" : \"output0\" } ] } For the above request the inference server must return the \u201coutput0\u201d output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned. HTTP/1.1 200 OK Content-Type: application/json Content-Length: { \"id\" : \"42\" \"outputs\" : [ { \"name\" : \"output0\", \"shape\" : [ 3, 2 ], \"datatype\" : \"FP32\", \"data\" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ] } ] }","title":"Inference Request Examples"},{"location":"modelserving/data_plane/v2_protocol/#grpc","text":"The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section. API rpc Endpoint Request Message Response Message Inference ModelInfer ModelInferRequest ModelInferResponse Model Ready ModelReady [ModelReadyRequest] ModelReadyResponse Model Metadata ModelMetadata ModelMetadataRequest ModelMetadataResponse Server Ready ServerReady ServerReadyRequest ServerReadyResponse Server Live ServerLive ServerLiveRequest ServerLiveResponse For more detailed information on each endpoint and its contents, see API Definitions and Message Contents . See also: The gRPC endpoints, request/response messages and contents are defined in grpc_predict_v2.proto","title":"gRPC"},{"location":"modelserving/data_plane/v2_protocol/#api-definitions_1","text":"The GRPC definition of the service is: // // Inference Server GRPC endpoints. // service GRPCInferenceService { // Check liveness of the inference server. rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {} // Check readiness of the inference server. rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {} // Check readiness of a model in the inference server. rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {} // Get server metadata. rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {} // Get model metadata. rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {} // Perform inference using a specific model. rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} }","title":"API Definitions"},{"location":"modelserving/data_plane/v2_protocol/#message-contents","text":"","title":"Message Contents"},{"location":"modelserving/data_plane/v2_protocol/#health","text":"A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.","title":"Health"},{"location":"modelserving/data_plane/v2_protocol/#server-live_1","text":"The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are: message ServerLiveRequest {} message ServerLiveResponse { // True if the inference server is live, false if not live. bool live = 1; }","title":"Server Live"},{"location":"modelserving/data_plane/v2_protocol/#server-ready_1","text":"The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are: message ServerReadyRequest {} message ServerReadyResponse { // True if the inference server is ready, false if not ready. bool ready = 1; }","title":"Server Ready"},{"location":"modelserving/data_plane/v2_protocol/#model-ready_1","text":"The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are: message ModelReadyRequest { // The name of the model to check for readiness. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelReadyResponse { // True if the model is ready, false if not ready. bool ready = 1; }","title":"Model Ready"},{"location":"modelserving/data_plane/v2_protocol/#metadata","text":"","title":"Metadata"},{"location":"modelserving/data_plane/v2_protocol/#server-metadata_1","text":"The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: message ServerMetadataRequest {} message ServerMetadataResponse { // The server name. string name = 1; // The server version. string version = 2; // The extensions supported by the server. repeated string extensions = 3; }","title":"Server Metadata"},{"location":"modelserving/data_plane/v2_protocol/#model-metadata_1","text":"The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are: message ModelMetadataRequest { // The name of the model. string name = 1; // The version of the model to check for readiness. If not given the // server will choose a version based on the model and internal policy. string version = 2; } message ModelMetadataResponse { // Metadata for a tensor. message TensorMetadata { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. A variable-size dimension is represented // by a -1 value. repeated int64 shape = 3; } // The model name. string name = 1; // The versions of the model available on the server. repeated string versions = 2; // The model's platform. See Platforms. string platform = 3; // The model's inputs. repeated TensorMetadata inputs = 4; // The model's outputs. repeated TensorMetadata outputs = 5; }","title":"Model Metadata"},{"location":"modelserving/data_plane/v2_protocol/#platforms","text":"A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format \u201c _ \u201d. The following platform names are allowed: tensorrt_plan : A TensorRT model encoded as a serialized engine or \u201cplan\u201d. tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. pytorch_torchscript : A PyTorch model encoded as TorchScript. mxnet_mxnet: An MXNet model caffe2_netdef : A Caffe2 model encoded as a NetDef.","title":"Platforms"},{"location":"modelserving/data_plane/v2_protocol/#inference_1","text":"The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are: message ModelInferRequest { // An input tensor for an inference request. message InferInputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional inference input tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference request. InferTensorContents contents = 5; } // An output tensor requested for an inference request. message InferRequestedOutputTensor { // The tensor name. string name = 1; // Optional requested output tensor parameters. map parameters = 2; } // The name of the model to use for inferencing. string model_name = 1; // The version of the model to use for inference. If not given the // server will choose a version based on the model and internal policy. string model_version = 2; // Optional identifier for the request. If specified will be // returned in the response. string id = 3; // Optional inference parameters. map parameters = 4; // The input tensors for the inference. repeated InferInputTensor inputs = 5; // The requested output tensors for the inference. Optional, if not // specified all outputs produced by the model will be returned. repeated InferRequestedOutputTensor outputs = 6; // The data contained in an input tensor can be represented in \"raw\" // bytes form or in the repeated type that matches the tensor's data // type. To use the raw representation 'raw_input_contents' must be // initialized with data for each tensor in the same order as // 'inputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferInputTensor::contents must // not be specified for any input tensor. repeated bytes raw_input_contents = 7; } message ModelInferResponse { // An output tensor returned for an inference request. message InferOutputTensor { // The tensor name. string name = 1; // The tensor data type. string datatype = 2; // The tensor shape. repeated int64 shape = 3; // Optional output tensor parameters. map parameters = 4; // The tensor contents using a data-type format. This field must // not be specified if \"raw\" tensor contents are being used for // the inference response. InferTensorContents contents = 5; } // The name of the model used for inference. string model_name = 1; // The version of the model used for inference. string model_version = 2; // The id of the inference request if one was specified. string id = 3; // Optional inference response parameters. map parameters = 4; // The output tensors holding inference results. repeated InferOutputTensor outputs = 5; // The data contained in an output tensor can be represented in // \"raw\" bytes form or in the repeated type that matches the // tensor's data type. To use the raw representation 'raw_output_contents' // must be initialized with data for each tensor in the same order as // 'outputs'. For each tensor, the size of this content must match // what is expected by the tensor's shape and data type. The raw // data must be the flattened, one-dimensional, row-major order of // the tensor elements without any stride or padding between the // elements. Note that the FP16 data type must be represented as raw // content as there is no specific data type for a 16-bit float // type. // // If this field is specified then InferOutputTensor::contents must // not be specified for any output tensor. repeated bytes raw_output_contents = 6; }","title":"Inference"},{"location":"modelserving/data_plane/v2_protocol/#parameters_1","text":"The Parameters message describes a \u201cname\u201d/\u201dvalue\u201d pair, where the \u201cname\u201d is the name of the parameter and the \u201cvalue\u201d is a boolean, integer, or string corresponding to the parameter. Currently, no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. // // An inference parameter value. // message InferParameter { // The parameter value can be a string, an int64, a boolean // or a message specific to a predefined parameter. oneof parameter_choice { // A boolean parameter value. bool bool_param = 1; // An int64 parameter value. int64 int64_param = 2; // A string parameter value. string string_param = 3; } }","title":"Parameters"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data_1","text":"In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in \"linear\" order without any stride or padding between elements. Using a \"raw\" representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231. An alternative to the \"raw\" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type. // // The data contained in a tensor represented by the repeated type // that matches the tensor's data type. Protobuf oneof is not used // because oneofs cannot contain repeated fields. // message InferTensorContents { // Representation for BOOL data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bool bool_contents = 1; // Representation for INT8, INT16, and INT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated int32 int_contents = 2; // Representation for INT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated int64 int64_contents = 3; // Representation for UINT8, UINT16, and UINT32 data types. The size // must match what is expected by the tensor's shape. The contents // must be the flattened, one-dimensional, row-major order of the // tensor elements. repeated uint32 uint_contents = 4; // Representation for UINT64 data types. The size must match what // is expected by the tensor's shape. The contents must be the // flattened, one-dimensional, row-major order of the tensor elements. repeated uint64 uint64_contents = 5; // Representation for FP32 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated float fp32_contents = 6; // Representation for FP64 data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated double fp64_contents = 7; // Representation for BYTES data type. The size must match what is // expected by the tensor's shape. The contents must be the flattened, // one-dimensional, row-major order of the tensor elements. repeated bytes bytes_contents = 8; }","title":"Tensor Data"},{"location":"modelserving/data_plane/v2_protocol/#tensor-data-types_1","text":"Tensor data types are shown in the following table along with the size of each type, in bytes. Data Type Size (bytes) BOOL 1 UINT8 1 UINT16 2 UINT32 4 UINT64 8 INT8 1 INT16 2 INT32 4 INT64 8 FP16 2 FP32 4 FP64 8 BYTES Variable (max 2 32 )","title":"Tensor Data Types"},{"location":"modelserving/detect/aif/germancredit/","text":"Bias detection on an InferenceService using AIF360 \u00b6 This is an example of how to get bias metrics using AI Fairness 360 (AIF360) on KServe. AI Fairness 360, an LF AI incubation project, is an extensible open source toolkit that can help users examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. We will be using the German Credit dataset maintained by the UC Irvine Machine Learning Repository . The German Credit dataset is a dataset that contains data as to whether or not a creditor gave a loan applicant access to a loan along with data about the applicant. The data includes relevant data on an applicant's credit history, savings, and employment as well as some data on the applicant's demographic such as age, sex, and marital status. Data like credit history, savings, and employment can be used by creditors to accurately predict the probability that an applicant will repay their loans, however, data such as age and sex should not be used to decide whether an applicant should be given a loan. We would like to be able to check if these \"protected classes\" are being used in a model's predictions. In this example we will feed the model some predictions and calculate metrics based off of the predictions the model makes. We will be using KServe payload logging capability collect the metrics. These metrics will give insight as to whether or not the model is biased for or against any protected classes. In this example we will look at the bias our deployed model has on those of age > 25 vs. those of age <= 25 and see if creditors are treating either unfairly. Sample resources for deploying the example can be found here Create the InferenceService \u00b6 Apply the CRD kubectl kubectl apply -f bias.yaml Expected Output $ inferenceservice.serving.kserve.io/german-credit created Deploy the message dumper (sample backend receiver for payload logs) \u00b6 Apply the message-dumper CRD which will collect the logs that are created when running predictions on the inferenceservice. In production setup, instead of message-dumper Kafka can be used to receive payload logs kubectl kubectl apply -f message-dumper.yaml Expected Output service.serving.knative.dev/message-dumper created Run a prediction \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = german-credit SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python simulate_predicts.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict ${ SERVICE_HOSTNAME } Process payload logs for metrics calculation \u00b6 Run json_from_logs.py which will craft a payload that AIF can interpret. First, the events logs are taken from the message-dumper and then those logs are parsed to match inputs with outputs. Then the input/outputs pairs are all combined into a list of inputs and a list of outputs for AIF to interpret. A data.json file should have been created in this folder which contains the json payload. python json_from_logs.py Run an explanation \u00b6 Finally, now that we have collected a number of our model's predictions and their corresponding inputs we will send these to the AIF server to calculate the bias metrics. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Interpreting the results \u00b6 Now let's look at one of the metrics. In this example disparate impact represents the ratio between the probability of applicants of the privileged class (age > 25) getting a loan and the probability of applicants of the unprivileged class (age <= 25) getting a loan P(Y=1|D=privileged)/P(Y=1|D=unprivileged) . Since, in the sample output below, the disparate impact is less that 1 then the probability that an applicant whose age is greater than 25 gets a loan is significantly higher than the probability that an applicant whose age is less than or equal to 25 gets a loan. This in and of itself is not proof that the model is biased, but does hint that there may be some bias and a deeper look may be needed. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Expected Output Sending bias query... TIME TAKEN: 0 .21137404441833496 base_rate : 0 .9329608938547486 consistency : [ 0 .982122905027933 ] disparate_impact : 0 .52 num_instances : 179 .0 num_negatives : 12 .0 num_positives : 167 .0 statistical_parity_difference : -0.48 Dataset \u00b6 The dataset used in this example is the German Credit dataset maintained by the UC Irvine Machine Learning Repository . Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.","title":"AIF Bias Detector"},{"location":"modelserving/detect/aif/germancredit/#bias-detection-on-an-inferenceservice-using-aif360","text":"This is an example of how to get bias metrics using AI Fairness 360 (AIF360) on KServe. AI Fairness 360, an LF AI incubation project, is an extensible open source toolkit that can help users examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. We will be using the German Credit dataset maintained by the UC Irvine Machine Learning Repository . The German Credit dataset is a dataset that contains data as to whether or not a creditor gave a loan applicant access to a loan along with data about the applicant. The data includes relevant data on an applicant's credit history, savings, and employment as well as some data on the applicant's demographic such as age, sex, and marital status. Data like credit history, savings, and employment can be used by creditors to accurately predict the probability that an applicant will repay their loans, however, data such as age and sex should not be used to decide whether an applicant should be given a loan. We would like to be able to check if these \"protected classes\" are being used in a model's predictions. In this example we will feed the model some predictions and calculate metrics based off of the predictions the model makes. We will be using KServe payload logging capability collect the metrics. These metrics will give insight as to whether or not the model is biased for or against any protected classes. In this example we will look at the bias our deployed model has on those of age > 25 vs. those of age <= 25 and see if creditors are treating either unfairly. Sample resources for deploying the example can be found here","title":"Bias detection on an InferenceService using AIF360"},{"location":"modelserving/detect/aif/germancredit/#create-the-inferenceservice","text":"Apply the CRD kubectl kubectl apply -f bias.yaml Expected Output $ inferenceservice.serving.kserve.io/german-credit created","title":"Create the InferenceService"},{"location":"modelserving/detect/aif/germancredit/#deploy-the-message-dumper-sample-backend-receiver-for-payload-logs","text":"Apply the message-dumper CRD which will collect the logs that are created when running predictions on the inferenceservice. In production setup, instead of message-dumper Kafka can be used to receive payload logs kubectl kubectl apply -f message-dumper.yaml Expected Output service.serving.knative.dev/message-dumper created","title":"Deploy the message dumper (sample backend receiver for payload logs)"},{"location":"modelserving/detect/aif/germancredit/#run-a-prediction","text":"The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = german-credit SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python simulate_predicts.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :predict ${ SERVICE_HOSTNAME }","title":"Run a prediction"},{"location":"modelserving/detect/aif/germancredit/#process-payload-logs-for-metrics-calculation","text":"Run json_from_logs.py which will craft a payload that AIF can interpret. First, the events logs are taken from the message-dumper and then those logs are parsed to match inputs with outputs. Then the input/outputs pairs are all combined into a list of inputs and a list of outputs for AIF to interpret. A data.json file should have been created in this folder which contains the json payload. python json_from_logs.py","title":"Process payload logs for metrics calculation"},{"location":"modelserving/detect/aif/germancredit/#run-an-explanation","text":"Finally, now that we have collected a number of our model's predictions and their corresponding inputs we will send these to the AIF server to calculate the bias metrics. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json","title":"Run an explanation"},{"location":"modelserving/detect/aif/germancredit/#interpreting-the-results","text":"Now let's look at one of the metrics. In this example disparate impact represents the ratio between the probability of applicants of the privileged class (age > 25) getting a loan and the probability of applicants of the unprivileged class (age <= 25) getting a loan P(Y=1|D=privileged)/P(Y=1|D=unprivileged) . Since, in the sample output below, the disparate impact is less that 1 then the probability that an applicant whose age is greater than 25 gets a loan is significantly higher than the probability that an applicant whose age is less than or equal to 25 gets a loan. This in and of itself is not proof that the model is biased, but does hint that there may be some bias and a deeper look may be needed. python query_bias.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ $MODEL_NAME :explain ${ SERVICE_HOSTNAME } input.json Expected Output Sending bias query... TIME TAKEN: 0 .21137404441833496 base_rate : 0 .9329608938547486 consistency : [ 0 .982122905027933 ] disparate_impact : 0 .52 num_instances : 179 .0 num_negatives : 12 .0 num_positives : 167 .0 statistical_parity_difference : -0.48","title":"Interpreting the results"},{"location":"modelserving/detect/aif/germancredit/#dataset","text":"The dataset used in this example is the German Credit dataset maintained by the UC Irvine Machine Learning Repository . Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.","title":"Dataset"},{"location":"modelserving/detect/aif/germancredit/server/","text":"Logistic Regression Model on the German Credit dataset \u00b6 Build a development docker image \u00b6 To build a development image first download these files and move them into the server/ folder - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc First build your docker image by changing directory to kserve/python and replacing dockeruser with your docker username in the snippet below (running this will take some time). docker build -t dockeruser/aifserver:latest -f aiffairness.Dockerfile . Then push your docker image to your dockerhub repo (this will take some time) docker push dockeruser/aifserver:latest Once your docker image is pushed you can pull the image from dockeruser/aifserver:latest when deploying an inferenceservice by specifying the image in the yaml file.","title":"Logistic Regression Model on the German Credit dataset"},{"location":"modelserving/detect/aif/germancredit/server/#logistic-regression-model-on-the-german-credit-dataset","text":"","title":"Logistic Regression Model on the German Credit dataset"},{"location":"modelserving/detect/aif/germancredit/server/#build-a-development-docker-image","text":"To build a development image first download these files and move them into the server/ folder - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data - https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc First build your docker image by changing directory to kserve/python and replacing dockeruser with your docker username in the snippet below (running this will take some time). docker build -t dockeruser/aifserver:latest -f aiffairness.Dockerfile . Then push your docker image to your dockerhub repo (this will take some time) docker push dockeruser/aifserver:latest Once your docker image is pushed you can pull the image from dockeruser/aifserver:latest when deploying an inferenceservice by specifying the image in the yaml file.","title":"Build a development docker image"},{"location":"modelserving/detect/alibi_detect/alibi_detect/","text":"Deploy InferenceService with Alibi Outlier/Drift Detector \u00b6 In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KServe integrates Alibi Detect with the following components: Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data. Outlier detector flags single instances which do not follow the training distribution. The architecture used is shown below and links the payload logging available within KServe with asynchronous processing of those payloads in KNative to detect outliers. CIFAR10 Outlier Detector \u00b6 A CIFAR10 Outlier Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18. CIFAR10 Drift Detector \u00b6 A CIFAR10 Drift Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"Alibi Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#deploy-inferenceservice-with-alibi-outlierdrift-detector","text":"In order to trust and reliably act on model predictions, it is crucial to monitor the distribution of the incoming requests via various different type of detectors. KServe integrates Alibi Detect with the following components: Drift detector checks when the distribution of incoming requests is diverging from a reference distribution such as that of the training data. Outlier detector flags single instances which do not follow the training distribution. The architecture used is shown below and links the payload logging available within KServe with asynchronous processing of those payloads in KNative to detect outliers.","title":"Deploy InferenceService with Alibi Outlier/Drift Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#cifar10-outlier-detector","text":"A CIFAR10 Outlier Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"CIFAR10 Outlier Detector"},{"location":"modelserving/detect/alibi_detect/alibi_detect/#cifar10-drift-detector","text":"A CIFAR10 Drift Detector. Run the notebook demo to test. The notebook requires KNative Eventing >= 0.18.","title":"CIFAR10 Drift Detector"},{"location":"modelserving/detect/art/mnist/","text":"Using ART to get adversarial examples for MNIST classifications \u00b6 This is an example to show how adversarially modified inputs can trick models to predict incorrectly to highlight model vulnerability to adversarial attacks. It is using the Adversarial Robustness Toolbox (ART) on KServe. ART provides tools that enable developers to evaluate, defend, and verify ML models and applications against adversarial threats. Apart from giving capabilities to craft adversarial attacks , it also provides algorithms to defend against them. We will be using the MNIST dataset which is a dataset of handwritten digits and find adversarial examples which can make the model predict a classification incorrectly, thereby showing the vulnerability of the model against adversarial attacks. Sample resources for deploying the example can be found here To deploy the inferenceservice with v1beta1 API kubectl apply -f art.yaml Then find the url kubectl get inferenceservice NAME URL READY DEFAULT TRAFFIC CANARY TRAFFIC AGE artserver http://artserver.somecluster/v1/models/artserver True 100 40m Explanation \u00b6 The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT MODEL_NAME = artserver SERVICE_HOSTNAME = $( kubectl get inferenceservice ${ MODEL_NAME } -o jsonpath = '{.status.url}' | cut -d \"/\" -f 3 ) python query_explain.py http:// ${ INGRESS_HOST } : ${ INGRESS_PORT } /v1/models/ ${ MODEL_NAME } :explain ${ SERVICE_HOSTNAME } After some time you should see a pop up containing the explanation, similar to the image below. If a pop up does not display and the message \"Unable to find an adversarial example.\" appears then an adversarial example could not be found for the image given in a timely manner. If a pop up does display then the image on the left is the original image and the image on the right is the adversarial example. The labels above both images represent what classification the model made for each individual image. The Square Attack method used in this example creates a random update at each iteration and adds this update to the adversarial input if it makes a misclassification more likely (more specifically, if it improves the objective function). Once enough random updates are added together and the model misclassifies then the resulting adversarial input will be returned and displayed. To try a different MNIST example add an integer to the end of the query between 0-9,999. The integer chosen will be the index of the image to be chosen in the MNIST dataset. Or to try a file with custom data add the file path to the end. Keep in mind that the data format must be {\"instances\": [,