Skip to content

Commit

Permalink
v13 release blog (#363)
Browse files Browse the repository at this point in the history
* parent 2257489
author Dan Sun <[email protected]> 1698039744 -0400
committer agriffith50 <[email protected]> 1716219052 -0400

parent 2257489
author Dan Sun <[email protected]> 1698039744 -0400
committer agriffith50 <[email protected]> 1716218313 -0400

parent 2257489
author Dan Sun <[email protected]> 1698039744 -0400
committer agriffith50 <[email protected]> 1716217744 -0400

Add TorchServe Huggingface accelerate example (#304)

* Add LLM example for huggingface accelerate

Signed-off-by: Dan Sun <[email protected]>

* Add inputs

Signed-off-by: Dan Sun <[email protected]>

* Update storage uri

Signed-off-by: Dan Sun <[email protected]>

* Add to LLM runtime to index

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

0.11 release blog (#310)

* Add 0.11 release blog

Signed-off-by: Dan Sun <[email protected]>

* Update blog

Signed-off-by: Dan Sun <[email protected]>

* Add vllm runtime doc

Signed-off-by: Dan Sun <[email protected]>

* Add vllm example doc

Signed-off-by: Dan Sun <[email protected]>

* Update blog link

Signed-off-by: Dan Sun <[email protected]>

* Add vLLM intro

Signed-off-by: Dan Sun <[email protected]>

* add python runtime open inference protocol tutorials

Signed-off-by: Dan Sun <[email protected]>

* Fix warning

Signed-off-by: Dan Sun <[email protected]>

* Add warning

Signed-off-by: Dan Sun <[email protected]>

* Address comments

Signed-off-by: Dan Sun <[email protected]>

* Fix newline

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

Fix torchserve llm example link

Signed-off-by: Dan Sun <[email protected]>

Fixed formatting in get_started (#319)

Signed-off-by: Helber Belmiro <[email protected]>

clarify prometheus annotation (#316)

Signed-off-by: JuHyung-Son <[email protected]>

Document servingruntime constraint introduced by kserve/kserve#3181 (#320)

* Document serving runtime constraint introduced by kserve/kserve#3181

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Set content type for predict/explainer curl requests

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update docs/modelserving/servingruntimes.md

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

Add kubeflow summit 2023 Jooho's presentation link (#325)

add kubeflow summit 2023 Jooho's presentation link

Signed-off-by: jooho <[email protected]>

docs: Add one related presentations from Kubeflow Summit 2023 (#327)

* docs: Add two new related presentations from Kubeflow Summit 2023Update presentations.md

Signed-off-by: Yuan Tang <[email protected]>

* Update presentations.md

Signed-off-by: Yuan Tang <[email protected]>

---------

Signed-off-by: Yuan Tang <[email protected]>

Added example for torchserve grpc v1 and v2. (#307)

* Added example for torchserve grpc v1 and v2.

Signed-off-by: Andrews Arokiam <[email protected]>

* Schema order changed.

Signed-off-by: Andrews Arokiam <[email protected]>

* corrected v2 REST input.

Signed-off-by: Andrews Arokiam <[email protected]>

* Updated grpc-v2 protocolVersion.

Signed-off-by: Andrews Arokiam <[email protected]>

* Update README.md

* Update README.md

* Update README.md

---------

Signed-off-by: Andrews Arokiam <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

Add link to release process doc in developer.md (#330)

Signed-off-by: Yuan Tang <[email protected]>

Update tranformer collocation docs for specifying storage uri (#323)

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

Fix incorrect edit URL to docs (#329)

Signed-off-by: Yuan Tang <[email protected]>

Set resources for inferencegraph example (#322)

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

Fixes #331 - broken link to AMD Inference Server (#332)

Tested locally with mkdocs serve

Render KServe Python Runtime API doc with mkdoc (#333)

* Update KServe python sdk docs

Signed-off-by: Dan Sun <[email protected]>

* Update serving runtime doc

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

Fix build: Install kserve for rendering the docstring (#334)

* Update KServe python sdk docs

Signed-off-by: Dan Sun <[email protected]>

* Install kserve sdk for mkdocstring

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

Onnx docs update (#275)

* Updated Onnx example.

Signed-off-by: Andrews Arokiam <[email protected]>

* Reverting sklearn doc update as there is a separate PR

Signed-off-by: andyi2it <[email protected]>

* Added new schema in onnx example.

Signed-off-by: Andrews Arokiam <[email protected]>

* protocolVersion and old schema updated with onnx example.

Signed-off-by: Andrews Arokiam <[email protected]>

---------

Signed-off-by: Andrews Arokiam <[email protected]>
Signed-off-by: andyi2it <[email protected]>

Standardized schema order (#318)

* Standardized schema's order.

Signed-off-by: Andrews Arokiam <[email protected]>

* Fix v2 spec for torch serve

---------

Signed-off-by: Andrews Arokiam <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

Update link to Slack instructions

Signed-off-by: Yuan (Terry) Tang <[email protected]>

Update README.md (#344)

Fix incorrect storage uri prefix

Signed-off-by: zoramt <[email protected]>

Added steps to delete model-store-pod (#343)

Signed-off-by: murata.yu <[email protected]>

Update README.md

Signed-off-by: Dan Sun <[email protected]>

Add documentation for modelcars (#337)

* Add documentation for modelcars, introduced in 0.12 as experimental feature

Signed-off-by: Roland Huß <[email protected]>

* added some references to this feature

Signed-off-by: Roland Huß <[email protected]>

---------

Signed-off-by: Roland Huß <[email protected]>

add certificate doc (#326)

* add certificate doc

Signed-off-by: jooho <[email protected]>

* Update mkdocs.yml

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: jooho <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

docs: fix the emoji deprecation message and invalid file name (#348)

Signed-off-by: Peter Jausovec <[email protected]>

Add documentation for GCS (#351)

* Add documentation for GCS

Signed-off-by: tjandy98 <[email protected]>

* Update mkdocs to include GCS

Signed-off-by: tjandy98 <[email protected]>

* Fix formatting

Signed-off-by: tjandy98 <[email protected]>

---------

Signed-off-by: tjandy98 <[email protected]>

Add ModelRegistry custom storage intializer example (#346)

* Add ModelRegistry custom storage intializer example

Signed-off-by: Andrea Lamparelli <[email protected]>

* Update docs/modelserving/storage/storagecontainers.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>

---------

Signed-off-by: Andrea Lamparelli <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

Updated docs for autoscaling on gpu. (#328)

Signed-off-by: Andrews Arokiam <[email protected]>

Update version matrix for 0.12 (#353)

* Update version matrix for 0.12

Signed-off-by: Dan Sun <[email protected]>

* Update kubernetes_deployment.md

Signed-off-by: Dan Sun <[email protected]>

* Update notes for gRPC issues

Signed-off-by: Dan Sun <[email protected]>

* Update kserve install

Signed-off-by: Dan Sun <[email protected]>

* Update kubernetes_deployment.md

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

docs: update kserve resource yaml file (#356)

fix docs

Signed-off-by: Niels ten Boom <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update serving runtime version for 0.12 release and add some notes (#354)

* Fix few bugs, add quick install failure note and update docs for release 0.12.0

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add warning about control plane namespaces

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Add Helm installation commands in get started guide

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Revert "Add Helm installation commands in get started guide"

This reverts commit bc90c25.

Signed-off-by: agriffith50 <[email protected]>

Add Helm installation commands in get started guide (#358)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update README.md (#359)

Fix broken link to Ray doc on fractional GPU allocation.

Signed-off-by: zoramt <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Add Huggingface Serving Runtime example with Llama2 (#345)

* Add Huggingface Serving Runtime example with Llama2

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix review comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* add linking

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* Update huggingface vllm runtime doc

Signed-off-by: Dan Sun <[email protected]>

* Update mkdocs.yml

Signed-off-by: Dan Sun <[email protected]>

* Update triton doc

Signed-off-by: Dan Sun <[email protected]>

* Fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update adopters.md (#361)

Signed-off-by: agriffith50 <[email protected]>

Point users to vLLM production server (#362)

The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead.

So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers.

Signed-off-by: Pierre Dulac <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

initial draft of kserve release blog

Signed-off-by: agriffith50 <[email protected]>

change title

Signed-off-by: agriffith50 <[email protected]>

resolving comments

Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

update comment

Signed-off-by: agriffith50 <[email protected]>

update for vllm comment

Signed-off-by: agriffith50 <[email protected]>

add more info about completions endpoints

Signed-off-by: agriffith50 <[email protected]>

add hf img

Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Sample requests update in HuggingFace runtime with vLLM support (#364)

Update Sample requests for HF runtime

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

add new kserve img

Signed-off-by: agriffith50 <[email protected]>

Update future plan and other changes

Signed-off-by: agriffith50 <[email protected]>

Add Huggingface Serving Runtime example with Llama2 (#345)

* Add Huggingface Serving Runtime example with Llama2

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix review comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* add linking

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* Update huggingface vllm runtime doc

Signed-off-by: Dan Sun <[email protected]>

* Update mkdocs.yml

Signed-off-by: Dan Sun <[email protected]>

* Update triton doc

Signed-off-by: Dan Sun <[email protected]>

* Fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Sample requests update in HuggingFace runtime with vLLM support (#364)

Update Sample requests for HF runtime

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update huggingface triton yaml

Signed-off-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update blog link

Signed-off-by: agriffith50 <[email protected]>

Add triton huggingface reference

Signed-off-by: agriffith50 <[email protected]>

resolve merge

Signed-off-by: agriffith50 <[email protected]>

docs: update kserve resource yaml file (#356)

fix docs

Signed-off-by: Niels ten Boom <[email protected]>

Add Helm installation commands in get started guide

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Revert "Add Helm installation commands in get started guide"

This reverts commit bc90c25.

Signed-off-by: agriffith50 <[email protected]>

Add Helm installation commands in get started guide (#358)

Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update README.md (#359)

Fix broken link to Ray doc on fractional GPU allocation.

Signed-off-by: zoramt <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Add Huggingface Serving Runtime example with Llama2 (#345)

* Add Huggingface Serving Runtime example with Llama2

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix review comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* add linking

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* Update huggingface vllm runtime doc

Signed-off-by: Dan Sun <[email protected]>

* Update mkdocs.yml

Signed-off-by: Dan Sun <[email protected]>

* Update triton doc

Signed-off-by: Dan Sun <[email protected]>

* Fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update adopters.md (#361)

Signed-off-by: agriffith50 <[email protected]>

Point users to vLLM production server (#362)

The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead.

So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers.

Signed-off-by: Pierre Dulac <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

initial draft of kserve release blog

Signed-off-by: agriffith50 <[email protected]>

change title

Signed-off-by: agriffith50 <[email protected]>

resolving comments

Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

update comment

Signed-off-by: agriffith50 <[email protected]>

update for vllm comment

Signed-off-by: agriffith50 <[email protected]>

add hf img

Signed-off-by: agriffith50 <[email protected]>

Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Sample requests update in HuggingFace runtime with vLLM support (#364)

Update Sample requests for HF runtime

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

add new kserve img

Signed-off-by: agriffith50 <[email protected]>

Update future plan and other changes

Add Huggingface Serving Runtime example with Llama2 (#345)

* Add Huggingface Serving Runtime example with Llama2

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* Fix examples

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix review comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* add linking

Signed-off-by: Gavrish Prabhu <[email protected]>

* fix comments

Signed-off-by: Gavrish Prabhu <[email protected]>

* Update huggingface vllm runtime doc

Signed-off-by: Dan Sun <[email protected]>

* Update mkdocs.yml

Signed-off-by: Dan Sun <[email protected]>

* Update triton doc

Signed-off-by: Dan Sun <[email protected]>

* Fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix newline

Signed-off-by: Dan Sun <[email protected]>

* fix Hugging Face

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Sample requests update in HuggingFace runtime with vLLM support (#364)

Update Sample requests for HF runtime

Signed-off-by: Gavrish Prabhu <[email protected]>

Update huggingface triton yaml

Signed-off-by: Dan Sun <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

Update blog link

Signed-off-by: agriffith50 <[email protected]>

Add triton huggingface reference

Signed-off-by: agriffith50 <[email protected]>

resolve merge

Signed-off-by: agriffith50 <[email protected]>

Add Helm installation commands in get started guide

Signed-off-by: Yuan Tang <[email protected]>

Revert "Add Helm installation commands in get started guide"

This reverts commit bc90c25.

Add Helm installation commands in get started guide (#358)

Signed-off-by: Yuan Tang <[email protected]>

Update README.md (#359)

Fix broken link to Ray doc on fractional GPU allocation.

Signed-off-by: zoramt <[email protected]>

Update adopters.md (#361)

Point users to vLLM production server (#362)

The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead.

So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers.

Signed-off-by: Pierre Dulac <[email protected]>

Sample requests update in HuggingFace runtime with vLLM support (#364)

Update Sample requests for HF runtime

Signed-off-by: Gavrish Prabhu <[email protected]>

Update huggingface triton yaml

Signed-off-by: Dan Sun <[email protected]>

* fix merge
Signed-off-by: agriffith50 <[email protected]>

* fix more merge issue

Signed-off-by: agriffith50 <[email protected]>

* Move up the diagram

Signed-off-by: agriffith50 <[email protected]>

* fix flag naming

Signed-off-by: agriffith50 <[email protected]>

* update slack

Signed-off-by: agriffith50 <[email protected]>

* Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Yuan Tang <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

* Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Yuan Tang <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

* Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md

Co-authored-by: Yuan Tang <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>

* fix Hugging Face

Signed-off-by: agriffith50 <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>
Signed-off-by: Alexa Griffith  <[email protected]>
Signed-off-by: agriffith50 <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
  • Loading branch information
3 people authored May 24, 2024
1 parent caf869d commit a458a43
Show file tree
Hide file tree
Showing 5 changed files with 158 additions and 3 deletions.
123 changes: 123 additions & 0 deletions docs/blog/articles/2024-05-15-Kserve-0.13-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# From Serverless Predictive Inference to Generative Inference: Introducing KServe v0.13

We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards.
![kserve-components](../../images/kserve_new.png)

Below are a summary of the key changes.

## Enhanced Hugging Face Runtime Support


KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a [KServe Hugging Face Serving Runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver), `kserve-huggingfaceserver`. With this implementation, KServe can now automatically infer a [task](https://huggingface.co/tasks) from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.

![kserve-huggingface](../../images/kserve-huggingface.png)

Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task.

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-bert
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=bert
- --model_id=bert-base-uncased
- --tensor_input_names=input_ids
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: 100m
memory: 2Gi
nvidia.com/gpu: "1"
```
You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details [here](https://kserve.github.io/website/master/modelserving/v1beta1/triton/huggingface/).
### vLLM support
Version 0.13 introduces dedicated runtime support for [vLLM](https://docs.vllm.ai/en/latest/), for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama2
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama2
- --model_id=meta-llama/Llama-2-7b-chat-hf
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
```
See more details in our updated docs to [Deploy the Llama2 model with Hugging Face LLM Serving Runtime](https://kserve.github.io/website/master/modelserving/v1beta1/llm/huggingface/).
Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the `--backend=huggingface` arg.


### OpenAI Schema Integration

Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models:

* `/openai/v1/completions`
* `/openai/v1/chat/completions`
* `/openai/v1/models`

These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The [chat completions endpoint](https://platform.openai.com/docs/guides/text-generation/chat-completions-api) is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The [completions endpoint](https://platform.openai.com/docs/guides/text-generation/completions-api) is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a `prompt`. Read more about the [chat completions](https://platform.openai.com/docs/api-reference/chat) and [completions](https://platform.openai.com/docs/api-reference/completions) endpoints int the OpenAI API docs.

This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.

### Future Plan
* Support other tasks like text embeddings [#3572](https://github.com/kserve/kserve/issues/3572])
* Support more LLM backend options in the future, such as TensorRT-LLM.
* Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) [#3461](https://github.com/kserve/kserve/issues/3461).
* KEDA integration for token based LLM Autoscaling [#3561](https://github.com/kserve/kserve/issues/3561).


## Other Changes

This release also includes several enhancements and changes:

### What's New?
* Async streaming support for v1 endpoints [#3402](https://github.com/kserve/kserve/issues/3402).
* Support for `.json` and `.ubj` model formats in the XGBoost server image [#3546](https://github.com/kserve/kserve/issues/3546).
* Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service [#2747](https://github.com/kserve/kserve/issues/2747).
* Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments [#3470](https://github.com/kserve/kserve/issues/3470).

### What's Changed?
* Removed Seldon Alibi dependency [#3380](https://github.com/kserve/kserve/issues/3380).
* Removal of conversion webhook from manifests. [#3344](https://github.com/kserve/kserve/issues/3344).

For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.13.0-rc0).


## Join the community

- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve)
- Join the Slack ([#kserve](https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues))
- Attend our community meeting by subscribing to the [KServe calendar](https://wiki.lfaidata.foundation/display/kserve/calendars).
- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Thanks for all the contributors who have made the commits to 0.13 release!

The KServe Project
Binary file added docs/images/kserve-huggingface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/kserve_new.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 34 additions & 2 deletions docs/modelserving/v1beta1/llm/huggingface/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@ The Hugging Face LLM serving runtime implements a runtime that can serve Hugging
In this example, we deploy a Llama2 model from Hugging Face by running an `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). Based on the performance requirement for large language models, KServe chooses to perform the inference using a more optimized inference engine like [vLLM](https://github.com/vllm-project/vllm) for text generation models.

### Serve the Hugging Face LLM model using vLLM
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference, higher throughput than Hugging Face API, implemented with paged attention, continous batching, optmized CUDA kernel.
You can still use `--backend=huggingface` in the container args to fall back to perform the inference using Hugging Face API.

KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference and higher throughput than the Hugging Face API, implemented with paged attention, continuous batching and an optimized CUDA kernel.

You can still use `--backend=huggingface` arg to fall back to perform the inference using Hugging Face API.


=== "Yaml"

Expand Down Expand Up @@ -62,6 +65,35 @@ Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
```

!!! success "Expected Output"

```{ .bash .no-copy }
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```

Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
```

Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
```
!!! success "Expected Output"

```{ .bash .no-copy }
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
```

Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }'

```
!!! success "Expected Output"
Expand Down
2 changes: 1 addition & 1 deletion overrides/main.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

{% block announce %}
<h1>
<b>KServe v0.11 is Released</b>, <a href="/website/0.11/blog/articles/2023-10-08-KServe-0.11-release/">Read blog &gt;&gt;</a>
<b>KServe v0.13 is Released</b>, <a href="/website/0.13/blog/articles/2024-05-15-Kserve-0.13-release/">Read blog &gt;&gt;</a>
</h1>
{% endblock %}

0 comments on commit a458a43

Please sign in to comment.