Skip to content

Commit

Permalink
Deployed daaa70a to master with MkDocs 1.5.3 and mike 2.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Nov 18, 2023
1 parent 8d71116 commit c720679
Show file tree
Hide file tree
Showing 6 changed files with 213 additions and 172 deletions.
18 changes: 13 additions & 5 deletions master/blog/articles/2023-10-08-KServe-0.11-release/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1025,7 +1025,8 @@ <h1>
<a class="md-nav__link" href="#llm-runtimes">
LLM Runtimes
</a>
</li>
<nav aria-label="LLM Runtimes" class="md-nav">
<ul class="md-nav__list">
<li class="md-nav__item">
<a class="md-nav__link" href="#torchserve-llm-runtime">
TorchServe LLM Runtime
Expand All @@ -1039,6 +1040,9 @@ <h1>
</ul>
</nav>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#modelmesh-updates">
ModelMesh Updates
Expand Down Expand Up @@ -1181,7 +1185,8 @@ <h1>
<a class="md-nav__link" href="#llm-runtimes">
LLM Runtimes
</a>
</li>
<nav aria-label="LLM Runtimes" class="md-nav">
<ul class="md-nav__list">
<li class="md-nav__item">
<a class="md-nav__link" href="#torchserve-llm-runtime">
TorchServe LLM Runtime
Expand All @@ -1195,6 +1200,9 @@ <h1>
</ul>
</nav>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a class="md-nav__link" href="#modelmesh-updates">
ModelMesh Updates
Expand Down Expand Up @@ -1329,10 +1337,10 @@ <h3 id="kserve-python-runtimes-improvements">KServe Python Runtimes Improvements
</li>
</ul>
<h3 id="llm-runtimes">LLM Runtimes<a class="headerlink" href="#llm-runtimes" title="Permanent link"></a></h3>
<h3 id="torchserve-llm-runtime">TorchServe LLM Runtime<a class="headerlink" href="#torchserve-llm-runtime" title="Permanent link"></a></h3>
<h4 id="torchserve-llm-runtime">TorchServe LLM Runtime<a class="headerlink" href="#torchserve-llm-runtime" title="Permanent link"></a></h4>
<p>KServe now integrates with TorchServe 0.8, offering the support for <a href="https://pytorch.org/serve/large_model_inference.html">LLM models</a> that may not fit onto a single GPU.
Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the <a href="../../modelserving/v1beta1/llm/">detailed example</a> for how to serve the LLM on KServe with TorchServe runtime.</p>
<h3 id="vllm-runtime">vLLM Runtime<a class="headerlink" href="#vllm-runtime" title="Permanent link"></a></h3>
Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the <a href="../../../modelserving/v1beta1/llm/torchserve/accelerate/">detailed example</a> for how to serve the LLM on KServe with TorchServe runtime.</p>
<h4 id="vllm-runtime">vLLM Runtime<a class="headerlink" href="#vllm-runtime" title="Permanent link"></a></h4>
<p>Serving LLM models can be surprisingly slow even on high end GPUs, <a href="https://github.com/vllm-project/vllm">vLLM</a> is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
It supports <a href="https://www.anyscale.com/blog/continuous-batching-llm-inference">continuous batching</a> for increased throughput and GPU utilization,
<a href="https://vllm.ai">paged attention</a> to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.</p>
Expand Down
2 changes: 1 addition & 1 deletion master/reference/swagger-ui/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1441,7 +1441,7 @@ <h1>
</a>
<h1 id="open-inference-protocol-api-specification">Open Inference Protocol API Specification<a class="headerlink" href="#open-inference-protocol-api-specification" title="Permanent link"></a></h1>
<h2 id="rest">REST<a class="headerlink" href="#rest" title="Permanent link"></a></h2>
<p><iframe class="swagger-ui-iframe" frameborder="0" id="c0dc9c26" src="swagger-c0dc9c26.html" style="overflow:hidden;width:100%;" width="100%"></iframe></p>
<p><iframe class="swagger-ui-iframe" frameborder="0" id="b5ed2c9c" src="swagger-b5ed2c9c.html" style="overflow:hidden;width:100%;" width="100%"></iframe></p>
<h2 id="grpc">GRPC<a class="headerlink" href="#grpc" title="Permanent link"></a></h2>
<h3 id="serverlive">ServerLive<a class="headerlink" href="#serverlive" title="Permanent link"></a></h3>
<p>The ServerLive API indicates if the inference server is able to receive
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@
}

const resize_ob = new ResizeObserver(function(entries) {
parent.update_swagger_ui_iframe_height("c0dc9c26");
parent.update_swagger_ui_iframe_height("b5ed2c9c");
});

// start observing for resizing
Expand Down
Loading

0 comments on commit c720679

Please sign in to comment.