Deployed daaa70a to master with MkDocs 1.5.3 and mike 2.0.0

kserve · Nov 18, 2023 · c720679 · c720679
1 parent 8d71116
commit c720679
Show file tree

Hide file tree

Showing 6 changed files with 213 additions and 172 deletions.
diff --git a/master/blog/articles/2023-10-08-KServe-0.11-release/index.html b/master/blog/articles/2023-10-08-KServe-0.11-release/index.html
@@ -1025,7 +1025,8 @@ <h1>
 <a class="md-nav__link" href="#llm-runtimes">
     LLM Runtimes
   </a>
-</li>
+<nav aria-label="LLM Runtimes" class="md-nav">
+<ul class="md-nav__list">
 <li class="md-nav__item">
 <a class="md-nav__link" href="#torchserve-llm-runtime">
     TorchServe LLM Runtime
@@ -1039,6 +1040,9 @@ <h1>
 </ul>
 </nav>
 </li>
+</ul>
+</nav>
+</li>
 <li class="md-nav__item">
 <a class="md-nav__link" href="#modelmesh-updates">
     ModelMesh Updates
@@ -1181,7 +1185,8 @@ <h1>
 <a class="md-nav__link" href="#llm-runtimes">
     LLM Runtimes
   </a>
-</li>
+<nav aria-label="LLM Runtimes" class="md-nav">
+<ul class="md-nav__list">
 <li class="md-nav__item">
 <a class="md-nav__link" href="#torchserve-llm-runtime">
     TorchServe LLM Runtime
@@ -1195,6 +1200,9 @@ <h1>
 </ul>
 </nav>
 </li>
+</ul>
+</nav>
+</li>
 <li class="md-nav__item">
 <a class="md-nav__link" href="#modelmesh-updates">
     ModelMesh Updates
@@ -1329,10 +1337,10 @@ <h3 id="kserve-python-runtimes-improvements">KServe Python Runtimes Improvements
 </li>
 </ul>
 <h3 id="llm-runtimes">LLM Runtimes<a class="headerlink" href="#llm-runtimes" title="Permanent link">¶</a></h3>
-<h3 id="torchserve-llm-runtime">TorchServe LLM Runtime<a class="headerlink" href="#torchserve-llm-runtime" title="Permanent link">¶</a></h3>
+<h4 id="torchserve-llm-runtime">TorchServe LLM Runtime<a class="headerlink" href="#torchserve-llm-runtime" title="Permanent link">¶</a></h4>
 <p>KServe now integrates with TorchServe 0.8, offering the support for <a href="https://pytorch.org/serve/large_model_inference.html">LLM models</a> that may not fit onto a single GPU. 
-Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the <a href="../../modelserving/v1beta1/llm/">detailed example</a> for how to serve the LLM on KServe with TorchServe runtime.</p>
-<h3 id="vllm-runtime">vLLM Runtime<a class="headerlink" href="#vllm-runtime" title="Permanent link">¶</a></h3>
+Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the <a href="../../../modelserving/v1beta1/llm/torchserve/accelerate/">detailed example</a> for how to serve the LLM on KServe with TorchServe runtime.</p>
+<h4 id="vllm-runtime">vLLM Runtime<a class="headerlink" href="#vllm-runtime" title="Permanent link">¶</a></h4>
 <p>Serving LLM models can be surprisingly slow even on high end GPUs, <a href="https://github.com/vllm-project/vllm">vLLM</a> is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. 
 It supports <a href="https://www.anyscale.com/blog/continuous-batching-llm-inference">continuous batching</a> for increased throughput and GPU utilization,
 <a href="https://vllm.ai">paged attention</a> to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.</p>

diff --git a/master/reference/swagger-ui/index.html b/master/reference/swagger-ui/index.html
@@ -1441,7 +1441,7 @@ <h1>
 </a>
 <h1 id="open-inference-protocol-api-specification">Open Inference Protocol API Specification<a class="headerlink" href="#open-inference-protocol-api-specification" title="Permanent link">¶</a></h1>
 <h2 id="rest">REST<a class="headerlink" href="#rest" title="Permanent link">¶</a></h2>
-<p><iframe class="swagger-ui-iframe" frameborder="0" id="c0dc9c26" src="swagger-c0dc9c26.html" style="overflow:hidden;width:100%;" width="100%"></iframe></p>
+<p><iframe class="swagger-ui-iframe" frameborder="0" id="b5ed2c9c" src="swagger-b5ed2c9c.html" style="overflow:hidden;width:100%;" width="100%"></iframe></p>
 <h2 id="grpc">GRPC<a class="headerlink" href="#grpc" title="Permanent link">¶</a></h2>
 <h3 id="serverlive">ServerLive<a class="headerlink" href="#serverlive" title="Permanent link">¶</a></h3>
 <p>The ServerLive API indicates if the inference server is able to receive 

diff --git a/...eference/swagger-ui/swagger-c0dc9c26.html → ...eference/swagger-ui/swagger-b5ed2c9c.html b/...eference/swagger-ui/swagger-c0dc9c26.html → ...eference/swagger-ui/swagger-b5ed2c9c.html
@@ -67,7 +67,7 @@
     }
 
     const resize_ob = new ResizeObserver(function(entries) {
-      parent.update_swagger_ui_iframe_height("c0dc9c26");
+      parent.update_swagger_ui_iframe_height("b5ed2c9c");
     });
 
     // start observing for resizing