v13 release blog (#363)

* parent 2257489 author Dan Sun <[email protected]> 1698039744 -0400 committer agriffith50 <[email protected]> 1716219052 -0400 parent 2257489 author Dan Sun <[email protected]> 1698039744 -0400 committer agriffith50 <[email protected]> 1716218313 -0400 parent 2257489 author Dan Sun <[email protected]> 1698039744 -0400 committer agriffith50 <[email protected]> 1716217744 -0400 Add TorchServe Huggingface accelerate example (#304) * Add LLM example for huggingface accelerate Signed-off-by: Dan Sun <[email protected]> * Add inputs Signed-off-by: Dan Sun <[email protected]> * Update storage uri Signed-off-by: Dan Sun <[email protected]> * Add to LLM runtime to index Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> 0.11 release blog (#310) * Add 0.11 release blog Signed-off-by: Dan Sun <[email protected]> * Update blog Signed-off-by: Dan Sun <[email protected]> * Add vllm runtime doc Signed-off-by: Dan Sun <[email protected]> * Add vllm example doc Signed-off-by: Dan Sun <[email protected]> * Update blog link Signed-off-by: Dan Sun <[email protected]> * Add vLLM intro Signed-off-by: Dan Sun <[email protected]> * add python runtime open inference protocol tutorials Signed-off-by: Dan Sun <[email protected]> * Fix warning Signed-off-by: Dan Sun <[email protected]> * Add warning Signed-off-by: Dan Sun <[email protected]> * Address comments Signed-off-by: Dan Sun <[email protected]> * Fix newline Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> Fix torchserve llm example link Signed-off-by: Dan Sun <[email protected]> Fixed formatting in get_started (#319) Signed-off-by: Helber Belmiro <[email protected]> clarify prometheus annotation (#316) Signed-off-by: JuHyung-Son <[email protected]> Document servingruntime constraint introduced by kserve/kserve#3181 (#320) * Document serving runtime constraint introduced by kserve/kserve#3181 Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Set content type for predict/explainer curl requests Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update docs/modelserving/servingruntimes.md Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Add kubeflow summit 2023 Jooho's presentation link (#325) add kubeflow summit 2023 Jooho's presentation link Signed-off-by: jooho <[email protected]> docs: Add one related presentations from Kubeflow Summit 2023 (#327) * docs: Add two new related presentations from Kubeflow Summit 2023Update presentations.md Signed-off-by: Yuan Tang <[email protected]> * Update presentations.md Signed-off-by: Yuan Tang <[email protected]> --------- Signed-off-by: Yuan Tang <[email protected]> Added example for torchserve grpc v1 and v2. (#307) * Added example for torchserve grpc v1 and v2. Signed-off-by: Andrews Arokiam <[email protected]> * Schema order changed. Signed-off-by: Andrews Arokiam <[email protected]> * corrected v2 REST input. Signed-off-by: Andrews Arokiam <[email protected]> * Updated grpc-v2 protocolVersion. Signed-off-by: Andrews Arokiam <[email protected]> * Update README.md * Update README.md * Update README.md --------- Signed-off-by: Andrews Arokiam <[email protected]> Co-authored-by: Dan Sun <[email protected]> Add link to release process doc in developer.md (#330) Signed-off-by: Yuan Tang <[email protected]> Update tranformer collocation docs for specifying storage uri (#323) Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Fix incorrect edit URL to docs (#329) Signed-off-by: Yuan Tang <[email protected]> Set resources for inferencegraph example (#322) Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Fixes #331 - broken link to AMD Inference Server (#332) Tested locally with mkdocs serve Render KServe Python Runtime API doc with mkdoc (#333) * Update KServe python sdk docs Signed-off-by: Dan Sun <[email protected]> * Update serving runtime doc Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> Fix build: Install kserve for rendering the docstring (#334) * Update KServe python sdk docs Signed-off-by: Dan Sun <[email protected]> * Install kserve sdk for mkdocstring Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> Onnx docs update (#275) * Updated Onnx example. Signed-off-by: Andrews Arokiam <[email protected]> * Reverting sklearn doc update as there is a separate PR Signed-off-by: andyi2it <[email protected]> * Added new schema in onnx example. Signed-off-by: Andrews Arokiam <[email protected]> * protocolVersion and old schema updated with onnx example. Signed-off-by: Andrews Arokiam <[email protected]> --------- Signed-off-by: Andrews Arokiam <[email protected]> Signed-off-by: andyi2it <[email protected]> Standardized schema order (#318) * Standardized schema's order. Signed-off-by: Andrews Arokiam <[email protected]> * Fix v2 spec for torch serve --------- Signed-off-by: Andrews Arokiam <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Update link to Slack instructions Signed-off-by: Yuan (Terry) Tang <[email protected]> Update README.md (#344) Fix incorrect storage uri prefix Signed-off-by: zoramt <[email protected]> Added steps to delete model-store-pod (#343) Signed-off-by: murata.yu <[email protected]> Update README.md Signed-off-by: Dan Sun <[email protected]> Add documentation for modelcars (#337) * Add documentation for modelcars, introduced in 0.12 as experimental feature Signed-off-by: Roland Huß <[email protected]> * added some references to this feature Signed-off-by: Roland Huß <[email protected]> --------- Signed-off-by: Roland Huß <[email protected]> add certificate doc (#326) * add certificate doc Signed-off-by: jooho <[email protected]> * Update mkdocs.yml Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: jooho <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> docs: fix the emoji deprecation message and invalid file name (#348) Signed-off-by: Peter Jausovec <[email protected]> Add documentation for GCS (#351) * Add documentation for GCS Signed-off-by: tjandy98 <[email protected]> * Update mkdocs to include GCS Signed-off-by: tjandy98 <[email protected]> * Fix formatting Signed-off-by: tjandy98 <[email protected]> --------- Signed-off-by: tjandy98 <[email protected]> Add ModelRegistry custom storage intializer example (#346) * Add ModelRegistry custom storage intializer example Signed-off-by: Andrea Lamparelli <[email protected]> * Update docs/modelserving/storage/storagecontainers.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Andrea Lamparelli <[email protected]> --------- Signed-off-by: Andrea Lamparelli <[email protected]> Co-authored-by: Dan Sun <[email protected]> Updated docs for autoscaling on gpu. (#328) Signed-off-by: Andrews Arokiam <[email protected]> Update version matrix for 0.12 (#353) * Update version matrix for 0.12 Signed-off-by: Dan Sun <[email protected]> * Update kubernetes_deployment.md Signed-off-by: Dan Sun <[email protected]> * Update notes for gRPC issues Signed-off-by: Dan Sun <[email protected]> * Update kserve install Signed-off-by: Dan Sun <[email protected]> * Update kubernetes_deployment.md Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> docs: update kserve resource yaml file (#356) fix docs Signed-off-by: Niels ten Boom <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update serving runtime version for 0.12 release and add some notes (#354) * Fix few bugs, add quick install failure note and update docs for release 0.12.0 Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Add warning about control plane namespaces Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Resolve comments Signed-off-by: Sivanantham Chinnaiyan <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: agriffith50 <[email protected]> Add Helm installation commands in get started guide Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: agriffith50 <[email protected]> Revert "Add Helm installation commands in get started guide" This reverts commit bc90c25. Signed-off-by: agriffith50 <[email protected]> Add Helm installation commands in get started guide (#358) Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update README.md (#359) Fix broken link to Ray doc on fractional GPU allocation. Signed-off-by: zoramt <[email protected]> Signed-off-by: agriffith50 <[email protected]> Add Huggingface Serving Runtime example with Llama2 (#345) * Add Huggingface Serving Runtime example with Llama2 Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * fix review comments Signed-off-by: Gavrish Prabhu <[email protected]> * add linking Signed-off-by: Gavrish Prabhu <[email protected]> * fix comments Signed-off-by: Gavrish Prabhu <[email protected]> * Update huggingface vllm runtime doc Signed-off-by: Dan Sun <[email protected]> * Update mkdocs.yml Signed-off-by: Dan Sun <[email protected]> * Update triton doc Signed-off-by: Dan Sun <[email protected]> * Fix Hugging Face Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix Hugging Face Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update adopters.md (#361) Signed-off-by: agriffith50 <[email protected]> Point users to vLLM production server (#362) The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead. So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers. Signed-off-by: Pierre Dulac <[email protected]> Signed-off-by: agriffith50 <[email protected]> initial draft of kserve release blog Signed-off-by: agriffith50 <[email protected]> change title Signed-off-by: agriffith50 <[email protected]> resolving comments Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> update comment Signed-off-by: agriffith50 <[email protected]> update for vllm comment Signed-off-by: agriffith50 <[email protected]> add more info about completions endpoints Signed-off-by: agriffith50 <[email protected]> add hf img Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Sample requests update in HuggingFace runtime with vLLM support (#364) Update Sample requests for HF runtime Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: agriffith50 <[email protected]> add new kserve img Signed-off-by: agriffith50 <[email protected]> Update future plan and other changes Signed-off-by: agriffith50 <[email protected]> Add Huggingface Serving Runtime example with Llama2 (#345) * Add Huggingface Serving Runtime example with Llama2 Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * fix review comments Signed-off-by: Gavrish Prabhu <[email protected]> * add linking Signed-off-by: Gavrish Prabhu <[email protected]> * fix comments Signed-off-by: Gavrish Prabhu <[email protected]> * Update huggingface vllm runtime doc Signed-off-by: Dan Sun <[email protected]> * Update mkdocs.yml Signed-off-by: Dan Sun <[email protected]> * Update triton doc Signed-off-by: Dan Sun <[email protected]> * Fix Hugging Face Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix Hugging Face Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Sample requests update in HuggingFace runtime with vLLM support (#364) Update Sample requests for HF runtime Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update huggingface triton yaml Signed-off-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update blog link Signed-off-by: agriffith50 <[email protected]> Add triton huggingface reference Signed-off-by: agriffith50 <[email protected]> resolve merge Signed-off-by: agriffith50 <[email protected]> docs: update kserve resource yaml file (#356) fix docs Signed-off-by: Niels ten Boom <[email protected]> Add Helm installation commands in get started guide Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: agriffith50 <[email protected]> Revert "Add Helm installation commands in get started guide" This reverts commit bc90c25. Signed-off-by: agriffith50 <[email protected]> Add Helm installation commands in get started guide (#358) Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update README.md (#359) Fix broken link to Ray doc on fractional GPU allocation. Signed-off-by: zoramt <[email protected]> Signed-off-by: agriffith50 <[email protected]> Add Huggingface Serving Runtime example with Llama2 (#345) * Add Huggingface Serving Runtime example with Llama2 Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * fix review comments Signed-off-by: Gavrish Prabhu <[email protected]> * add linking Signed-off-by: Gavrish Prabhu <[email protected]> * fix comments Signed-off-by: Gavrish Prabhu <[email protected]> * Update huggingface vllm runtime doc Signed-off-by: Dan Sun <[email protected]> * Update mkdocs.yml Signed-off-by: Dan Sun <[email protected]> * Update triton doc Signed-off-by: Dan Sun <[email protected]> * Fix Hugging Face Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix Hugging Face Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update adopters.md (#361) Signed-off-by: agriffith50 <[email protected]> Point users to vLLM production server (#362) The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead. So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers. Signed-off-by: Pierre Dulac <[email protected]> Signed-off-by: agriffith50 <[email protected]> initial draft of kserve release blog Signed-off-by: agriffith50 <[email protected]> change title Signed-off-by: agriffith50 <[email protected]> resolving comments Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> update comment Signed-off-by: agriffith50 <[email protected]> update for vllm comment Signed-off-by: agriffith50 <[email protected]> add hf img Signed-off-by: agriffith50 <[email protected]> Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Sample requests update in HuggingFace runtime with vLLM support (#364) Update Sample requests for HF runtime Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: agriffith50 <[email protected]> add new kserve img Signed-off-by: agriffith50 <[email protected]> Update future plan and other changes Add Huggingface Serving Runtime example with Llama2 (#345) * Add Huggingface Serving Runtime example with Llama2 Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * Fix examples Signed-off-by: Gavrish Prabhu <[email protected]> * fix review comments Signed-off-by: Gavrish Prabhu <[email protected]> * add linking Signed-off-by: Gavrish Prabhu <[email protected]> * fix comments Signed-off-by: Gavrish Prabhu <[email protected]> * Update huggingface vllm runtime doc Signed-off-by: Dan Sun <[email protected]> * Update mkdocs.yml Signed-off-by: Dan Sun <[email protected]> * Update triton doc Signed-off-by: Dan Sun <[email protected]> * Fix Hugging Face Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix newline Signed-off-by: Dan Sun <[email protected]> * fix Hugging Face Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Sample requests update in HuggingFace runtime with vLLM support (#364) Update Sample requests for HF runtime Signed-off-by: Gavrish Prabhu <[email protected]> Update huggingface triton yaml Signed-off-by: Dan Sun <[email protected]> Signed-off-by: agriffith50 <[email protected]> Update blog link Signed-off-by: agriffith50 <[email protected]> Add triton huggingface reference Signed-off-by: agriffith50 <[email protected]> resolve merge Signed-off-by: agriffith50 <[email protected]> Add Helm installation commands in get started guide Signed-off-by: Yuan Tang <[email protected]> Revert "Add Helm installation commands in get started guide" This reverts commit bc90c25. Add Helm installation commands in get started guide (#358) Signed-off-by: Yuan Tang <[email protected]> Update README.md (#359) Fix broken link to Ray doc on fractional GPU allocation. Signed-off-by: zoramt <[email protected]> Update adopters.md (#361) Point users to vLLM production server (#362) The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead. So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers. Signed-off-by: Pierre Dulac <[email protected]> Sample requests update in HuggingFace runtime with vLLM support (#364) Update Sample requests for HF runtime Signed-off-by: Gavrish Prabhu <[email protected]> Update huggingface triton yaml Signed-off-by: Dan Sun <[email protected]> * fix merge Signed-off-by: agriffith50 <[email protected]> * fix more merge issue Signed-off-by: agriffith50 <[email protected]> * Move up the diagram Signed-off-by: agriffith50 <[email protected]> * fix flag naming Signed-off-by: agriffith50 <[email protected]> * update slack Signed-off-by: agriffith50 <[email protected]> * Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Yuan Tang <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> * Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Yuan Tang <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> * Update docs/blog/articles/2024-05-15-Kserve-0.13-release.md Co-authored-by: Yuan Tang <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> * fix Hugging Face Signed-off-by: agriffith50 <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> Signed-off-by: Alexa Griffith <[email protected]> Signed-off-by: agriffith50 <[email protected]> Co-authored-by: Dan Sun <[email protected]> Co-authored-by: Yuan Tang <[email protected]>
kserve · May 24, 2024 · a458a43 · a458a43
1 parent caf869d
commit a458a43
Show file tree

Hide file tree

Showing 5 changed files with 158 additions and 3 deletions.
diff --git a/docs/blog/articles/2024-05-15-Kserve-0.13-release.md b/docs/blog/articles/2024-05-15-Kserve-0.13-release.md
@@ -0,0 +1,123 @@
+# From Serverless Predictive Inference to Generative Inference: Introducing KServe v0.13
+
+We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards.
+![kserve-components](../../images/kserve_new.png)
+
+Below are a summary of the key changes.
+
+## Enhanced Hugging Face Runtime Support
+
+
+KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a [KServe Hugging Face Serving Runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver), `kserve-huggingfaceserver`. With this implementation, KServe can now automatically infer a [task](https://huggingface.co/tasks) from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.
+
+![kserve-huggingface](../../images/kserve-huggingface.png)
+
+Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task.
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: huggingface-bert
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: huggingface
+      args:
+      - --model_name=bert
+      - --model_id=bert-base-uncased
+      - --tensor_input_names=input_ids
+      resources:
+        limits:
+          cpu: "1"
+          memory: 2Gi
+          nvidia.com/gpu: "1"
+        requests:
+          cpu: 100m
+          memory: 2Gi
+          nvidia.com/gpu: "1"
+```
+
+You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details [here](https://kserve.github.io/website/master/modelserving/v1beta1/triton/huggingface/).
+
+
+### vLLM support 
+
+Version 0.13 introduces dedicated runtime support for [vLLM](https://docs.vllm.ai/en/latest/), for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below. 
+
+```yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  name: huggingface-llama2
+spec:
+  predictor:
+    model:
+      modelFormat:
+        name: huggingface
+      args:
+      - --model_name=llama2
+      - --model_id=meta-llama/Llama-2-7b-chat-hf
+      resources:
+        limits:
+          cpu: "6"
+          memory: 24Gi
+          nvidia.com/gpu: "1"
+        requests:
+          cpu: "6"
+          memory: 24Gi
+          nvidia.com/gpu: "1"
+```
+
+See more details in our updated docs to [Deploy the Llama2 model with Hugging Face LLM Serving Runtime](https://kserve.github.io/website/master/modelserving/v1beta1/llm/huggingface/).
+
+Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the `--backend=huggingface` arg.
+
+
+### OpenAI Schema Integration
+
+Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models:
+
+* `/openai/v1/completions`
+* `/openai/v1/chat/completions`
+* `/openai/v1/models`
+
+These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The [chat completions endpoint](https://platform.openai.com/docs/guides/text-generation/chat-completions-api) is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The [completions endpoint](https://platform.openai.com/docs/guides/text-generation/completions-api) is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a `prompt`. Read more about the [chat completions](https://platform.openai.com/docs/api-reference/chat) and [completions](https://platform.openai.com/docs/api-reference/completions) endpoints int the OpenAI API docs. 
+
+This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex. 
+
+### Future Plan
+* Support other tasks like text embeddings [#3572](https://github.com/kserve/kserve/issues/3572])
+* Support more LLM backend options in the future, such as TensorRT-LLM.
+* Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) [#3461](https://github.com/kserve/kserve/issues/3461).
+* KEDA integration for token based LLM Autoscaling [#3561](https://github.com/kserve/kserve/issues/3561).
+
+
+## Other Changes
+
+This release also includes several enhancements and changes:
+
+### What's New?
+* Async streaming support for v1 endpoints [#3402](https://github.com/kserve/kserve/issues/3402).
+* Support for `.json` and `.ubj` model formats in the XGBoost server image [#3546](https://github.com/kserve/kserve/issues/3546).
+* Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service [#2747](https://github.com/kserve/kserve/issues/2747).
+* Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments [#3470](https://github.com/kserve/kserve/issues/3470).
+
+### What's Changed?
+* Removed Seldon Alibi dependency [#3380](https://github.com/kserve/kserve/issues/3380).
+* Removal of conversion webhook from manifests. [#3344](https://github.com/kserve/kserve/issues/3344).
+
+For complete details on the new features and updates, visit our [official release notes](https://github.com/kserve/kserve/releases/tag/v0.13.0-rc0).
+
+
+## Join the community
+
+- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve)
+- Join the Slack ([#kserve](https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues))
+- Attend our community meeting by subscribing to the [KServe calendar](https://wiki.lfaidata.foundation/display/kserve/calendars).
+- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!
+
+Thanks for all the contributors who have made the commits to 0.13 release!
+
+The KServe Project
diff --git a/docs/images/kserve-huggingface.png b/docs/images/kserve-huggingface.png
diff --git a/docs/images/kserve_new.png b/docs/images/kserve_new.png
diff --git a/docs/modelserving/v1beta1/llm/huggingface/README.md b/docs/modelserving/v1beta1/llm/huggingface/README.md
@@ -4,8 +4,11 @@ The Hugging Face LLM serving runtime implements a runtime that can serve Hugging
 In this example, we deploy a Llama2 model from Hugging Face by running an `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). Based on the performance requirement for large language models, KServe chooses to perform the inference using a more optimized inference engine like [vLLM](https://github.com/vllm-project/vllm) for text generation models.
 
 ### Serve the Hugging Face LLM model using vLLM
-KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference, higher throughput than Hugging Face API, implemented with paged attention, continous batching, optmized CUDA kernel. 
-You can still use `--backend=huggingface` in the container args to fall back to perform the inference using Hugging Face API.
+
+KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference and higher throughput than the Hugging Face API, implemented with paged attention, continuous batching and an optimized CUDA kernel. 
+
+You can still use `--backend=huggingface` arg to fall back to perform the inference using Hugging Face API.
+
 
 === "Yaml"
 
@@ -62,6 +65,35 @@ Sample OpenAI Completions request:
 
 ```bash
 curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
+```
+
+!!! success "Expected Output"
+
+  ```{ .bash .no-copy }
+    {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
+  ```
+
+Sample OpenAI Chat request:
+
+```bash
+curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
+```
+
+Sample OpenAI Completions request:
+
+```bash
+curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
+```
+!!! success "Expected Output"
+
+  ```{ .bash .no-copy }
+    {"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
+  ```
+
+Sample OpenAI Chat request:
+
+```bash
+curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }'
 
 ```
 !!! success "Expected Output"

diff --git a/overrides/main.html b/overrides/main.html
@@ -2,6 +2,6 @@
 
 {% block announce %}
  <h1>
-   <b>KServe v0.11 is Released</b>, <a href="/website/0.11/blog/articles/2023-10-08-KServe-0.11-release/">Read blog &gt;&gt;</a>
+   <b>KServe v0.13 is Released</b>, <a href="/website/0.13/blog/articles/2024-05-15-Kserve-0.13-release/">Read blog &gt;&gt;</a>
  </h1>
 {% endblock %}