Skip to content

Commit

Permalink
Improve Huggingface docs (#369)
Browse files Browse the repository at this point in the history
* Update hf docs

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Generate api ref docs

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update README.md

Signed-off-by: Dan Sun <[email protected]>

* Update README.md

Signed-off-by: Dan Sun <[email protected]>

* Update huggingface arguments

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
  • Loading branch information
sivanantha321 and yuzisun authored Jun 9, 2024
1 parent e2d1a2a commit f41252e
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 161 deletions.
121 changes: 70 additions & 51 deletions docs/modelserving/v1beta1/llm/huggingface/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# Deploy the Llama2 model with Hugging Face LLM Serving Runtime
The Hugging Face LLM serving runtime implements a runtime that can serve Hugging Face LLM model out of the box.
# Deploy the Llama3 model with Hugging Face LLM Serving Runtime
The Hugging Face serving runtime implements a runtime that can serve Hugging Face models out of the box.
The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
token-classification, text-generation, text2text-generation, fill-mask.

In this example, we deploy a Llama2 model from Hugging Face by running an `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). Based on the performance requirement for large language models, KServe chooses to perform the inference using a more optimized inference engine like [vLLM](https://github.com/vllm-project/vllm) for text generation models.
Based on the performance requirement for large language models(LLM), KServe chooses to run the optimized inference engine [vLLM](https://github.com/vllm-project/vllm) for text generation tasks by default considering its ease-of-use and high performance.

### Serve the Hugging Face LLM model using vLLM
In this example, we deploy a Llama3 model from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver).

KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference and higher throughput than the Hugging Face API, implemented with paged attention, continuous batching and an optimized CUDA kernel.
### Serve the Hugging Face LLM model using vLLM backend

You can still use `--backend=huggingface` arg to fall back to perform the inference using Hugging Face API.
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.


=== "Yaml"
Expand All @@ -17,15 +19,15 @@ You can still use `--backend=huggingface` arg to fall back to perform the infere
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama2
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama2
- --model_id=meta-llama/Llama-2-7b-chat-hf
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
Expand All @@ -37,79 +39,96 @@ You can still use `--backend=huggingface` arg to fall back to perform the infere
nvidia.com/gpu: "1"
EOF
```
!!! note
1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.

### Perform Model Inference

The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.

```bash
MODEL_NAME=llama2
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama2 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
MODEL_NAME=llama3
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

Perform inference with v1 REST Protocol

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }'
```

!!! success "Expected Output"

```{ .bash .no-copy }
{"predictions":["Where is Eiffel Tower?\nEiffel Tower is located in Paris, France. It is one of the most iconic landmarks in the world and stands at 324 meters (1,063 feet) tall. The tower was built for the 1889 World's Fair in Paris and was designed by Gustave Eiffel. It is made of iron and has four pillars that support the tower. The Eiffel Tower is a popular tourist destination and offers stunning views of the city of Paris."]}
```

KServe Hugging Face vLLM runtime supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference

Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
```

!!! success "Expected Output"

```{ .bash .no-copy }
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```

Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
```

Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
```
!!! success "Expected Output"

```{ .bash .no-copy }
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```

Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }'
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'

```
!!! success "Expected Output"

```{ .bash .no-copy }
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama3","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
```

Sample OpenAI Chat request:
### Serve the Hugging Face LLM model using HuggingFace Backend
You can use `--backend=huggingface` arg to perform the inference using Hugging Face. KServe Hugging Face backend runtime also
supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
=== "Yaml"

```
!!! success "Expected Output"
```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
- --backend=huggingface
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
```

```{ .bash .no-copy }
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
```
### Hugging Face Runtime Arguments

Below, you can find an explanation of command line arguments which are supported for Hugging Face runtime. [vLLM backend engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html) can also be specified on the command line argument which is parsed by the Hugging Face runtime.

- `--model_name`: The name of the model used on the endpoint path.
- `--model_dir`: The local path where the model is downloaded to. If `model_id` is provided, this argument will be ignored.
- `--model_id`: Huggingface model id.
- `--model_revision`: Huggingface model revision.
- `--tokenizer_revision`: Huggingface tokenizer revision.
- `--dtype`: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'.
Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends.
Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads.
- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'.
- `--backend`: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'.
- `--max_length`: Max sequence length for the tokenizer.
- `--disable_lower_case`: Disable lower case for the tokenizer.
- `--disable_special_tokens`: The sequences will not be encoded with the special tokens relative to the model.
- `--trust_remote_code`: Allow loading of models and tokenizers with custom code.
- `--tensor_input_names`: The tensor input names passed to the model for triton inference server backend.
- `--return_token_type_ids`: Return token type ids.
- `--return_probabilities`: Return probabilities of predicted indexes. This is only applicable for tasks 'sequence_classification', 'token_classification' and 'fill_mask'.
139 changes: 29 additions & 110 deletions docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2045,7 +2045,7 @@ http://<inferenceservice.metadata.name>/v1/models/<trainedmodel>.metadata.name</
<hr/>
<p><em>
Generated with <code>gen-crd-api-reference-docs</code>
on git commit <code>426fe21d</code>.
on git commit <code>1c51eeee</code>.
</em></p>
<h2 id="serving.kserve.io/v1beta1">serving.kserve.io/v1beta1</h2>
<div>
Expand Down Expand Up @@ -2118,86 +2118,6 @@ ExplainerExtensionSpec
<td></td>
</tr></tbody>
</table>
<h3 id="serving.kserve.io/v1beta1.AlibiExplainerSpec">AlibiExplainerSpec
</h3>
<p>
(<em>Appears on:</em><a href="#serving.kserve.io/v1beta1.ExplainerSpec">ExplainerSpec</a>)
</p>
<div>
<p>AlibiExplainerSpec defines the arguments for configuring an Alibi Explanation Server</p>
</div>
<table>
<thead>
<tr>
<th>Field</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<code>type</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.AlibiExplainerType">
AlibiExplainerType
</a>
</em>
</td>
<td>
<p>The type of Alibi explainer <br />
Valid values are: <br />
- &ldquo;AnchorTabular&rdquo;; <br />
- &ldquo;AnchorImages&rdquo;; <br />
- &ldquo;AnchorText&rdquo;; <br />
- &ldquo;Counterfactuals&rdquo;; <br />
- &ldquo;Contrastive&rdquo;; <br /></p>
</td>
</tr>
<tr>
<td>
<code>ExplainerExtensionSpec</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.ExplainerExtensionSpec">
ExplainerExtensionSpec
</a>
</em>
</td>
<td>
<p>
(Members of <code>ExplainerExtensionSpec</code> are embedded into this type.)
</p>
<p>Contains fields shared across all explainers</p>
</td>
</tr>
</tbody>
</table>
<h3 id="serving.kserve.io/v1beta1.AlibiExplainerType">AlibiExplainerType
(<code>string</code> alias)</h3>
<p>
(<em>Appears on:</em><a href="#serving.kserve.io/v1beta1.AlibiExplainerSpec">AlibiExplainerSpec</a>)
</p>
<div>
<p>AlibiExplainerType is the explanation method</p>
</div>
<table>
<thead>
<tr>
<th>Value</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr><td><p>&#34;AnchorImages&#34;</p></td>
<td></td>
</tr><tr><td><p>&#34;AnchorTabular&#34;</p></td>
<td></td>
</tr><tr><td><p>&#34;AnchorText&#34;</p></td>
<td></td>
</tr><tr><td><p>&#34;Contrastive&#34;</p></td>
<td></td>
</tr><tr><td><p>&#34;Counterfactuals&#34;</p></td>
<td></td>
</tr></tbody>
</table>
<h3 id="serving.kserve.io/v1beta1.Batcher">Batcher
</h3>
<p>
Expand Down Expand Up @@ -2418,6 +2338,20 @@ map[string]string
More info: <a href="http://kubernetes.io/docs/user-guide/annotations">http://kubernetes.io/docs/user-guide/annotations</a></p>
</td>
</tr>
<tr>
<td>
<code>deploymentStrategy</code><br/>
<em>
<a href="https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.25/#deploymentstrategy-v1-apps">
Kubernetes apps/v1.DeploymentStrategy
</a>
</em>
</td>
<td>
<em>(Optional)</em>
<p>The deployment strategy to use to replace existing pods with new ones. Only applicable for raw deployment mode.</p>
</td>
</tr>
</tbody>
</table>
<h3 id="serving.kserve.io/v1beta1.ComponentImplementation">ComponentImplementation
Expand Down Expand Up @@ -2739,7 +2673,7 @@ string
<h3 id="serving.kserve.io/v1beta1.ExplainerExtensionSpec">ExplainerExtensionSpec
</h3>
<p>
(<em>Appears on:</em><a href="#serving.kserve.io/v1beta1.ARTExplainerSpec">ARTExplainerSpec</a>, <a href="#serving.kserve.io/v1beta1.AlibiExplainerSpec">AlibiExplainerSpec</a>)
(<em>Appears on:</em><a href="#serving.kserve.io/v1beta1.ARTExplainerSpec">ARTExplainerSpec</a>)
</p>
<div>
<p>ExplainerExtensionSpec defines configuration shared across all explainer frameworks</p>
Expand Down Expand Up @@ -2838,19 +2772,6 @@ The following fields follow a &ldquo;1-of&rdquo; semantic. Users must specify ex
<tbody>
<tr>
<td>
<code>alibi</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.AlibiExplainerSpec">
AlibiExplainerSpec
</a>
</em>
</td>
<td>
<p>Spec for alibi explainer</p>
</td>
</tr>
<tr>
<td>
<code>art</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.ARTExplainerSpec">
Expand All @@ -2877,8 +2798,8 @@ PodSpec
</p>
<p>This spec is dual purpose.
1) Users may choose to provide a full PodSpec for their custom explainer.
The field PodSpec.Containers is mutually exclusive with other explainers (i.e. Alibi).
2) Users may choose to provide a Explainer (i.e. Alibi) and specify PodSpec
The field PodSpec.Containers is mutually exclusive with other explainers.
2) Users may choose to provide a Explainer and specify PodSpec
overrides in the PodSpec. They must not provide PodSpec.Containers in this case.</p>
</td>
</tr>
Expand Down Expand Up @@ -2917,18 +2838,6 @@ ComponentExtensionSpec
<tbody>
<tr>
<td>
<code>alibi</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.ExplainerConfig">
ExplainerConfig
</a>
</em>
</td>
<td>
</td>
</tr>
<tr>
<td>
<code>art</code><br/>
<em>
<a href="#serving.kserve.io/v1beta1.ExplainerConfig">
Expand Down Expand Up @@ -3460,6 +3369,16 @@ string
</tr>
<tr>
<td>
<code>additionalIngressDomains</code><br/>
<em>
[]string
</em>
</td>
<td>
</td>
</tr>
<tr>
<td>
<code>domainTemplate</code><br/>
<em>
string
Expand Down Expand Up @@ -5300,5 +5219,5 @@ PredictorExtensionSpec
<hr/>
<p><em>
Generated with <code>gen-crd-api-reference-docs</code>
on git commit <code>426fe21d</code>.
on git commit <code>1c51eeee</code>.
</em></p>

0 comments on commit f41252e

Please sign in to comment.