diff --git a/docs/modelserving/v1beta1/llm/huggingface/README.md b/docs/modelserving/v1beta1/llm/huggingface/README.md index 602d40507..d6334a44b 100644 --- a/docs/modelserving/v1beta1/llm/huggingface/README.md +++ b/docs/modelserving/v1beta1/llm/huggingface/README.md @@ -1,13 +1,15 @@ -# Deploy the Llama2 model with Hugging Face LLM Serving Runtime -The Hugging Face LLM serving runtime implements a runtime that can serve Hugging Face LLM model out of the box. +# Deploy the Llama3 model with Hugging Face LLM Serving Runtime +The Hugging Face serving runtime implements a runtime that can serve Hugging Face models out of the box. +The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, +token-classification, text-generation, text2text-generation, fill-mask. -In this example, we deploy a Llama2 model from Hugging Face by running an `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). Based on the performance requirement for large language models, KServe chooses to perform the inference using a more optimized inference engine like [vLLM](https://github.com/vllm-project/vllm) for text generation models. +Based on the performance requirement for large language models(LLM), KServe chooses to run the optimized inference engine [vLLM](https://github.com/vllm-project/vllm) for text generation tasks by default considering its ease-of-use and high performance. -### Serve the Hugging Face LLM model using vLLM +In this example, we deploy a Llama3 model from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). -KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster inference and higher throughput than the Hugging Face API, implemented with paged attention, continuous batching and an optimized CUDA kernel. +### Serve the Hugging Face LLM model using vLLM backend -You can still use `--backend=huggingface` arg to fall back to perform the inference using Hugging Face API. +KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe. === "Yaml" @@ -17,15 +19,15 @@ You can still use `--backend=huggingface` arg to fall back to perform the infere apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: - name: huggingface-llama2 + name: huggingface-llama3 spec: predictor: model: modelFormat: name: huggingface args: - - --model_name=llama2 - - --model_id=meta-llama/Llama-2-7b-chat-hf + - --model_name=llama3 + - --model_id=meta-llama/meta-llama-3-8b-instruct resources: limits: cpu: "6" @@ -37,79 +39,96 @@ You can still use `--backend=huggingface` arg to fall back to perform the infere nvidia.com/gpu: "1" EOF ``` +!!! note + 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance. + 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry. ### Perform Model Inference The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`. ```bash -MODEL_NAME=llama2 -SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama2 -o jsonpath='{.status.url}' | cut -d "/" -f 3) +MODEL_NAME=llama3 +SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3) ``` -Perform inference with v1 REST Protocol - -```bash -curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }' -``` - -!!! success "Expected Output" - - ```{ .bash .no-copy } - {"predictions":["Where is Eiffel Tower?\nEiffel Tower is located in Paris, France. It is one of the most iconic landmarks in the world and stands at 324 meters (1,063 feet) tall. The tower was built for the 1889 World's Fair in Paris and was designed by Gustave Eiffel. It is made of iron and has four pillars that support the tower. The Eiffel Tower is a popular tourist destination and offers stunning views of the city of Paris."]} - ``` - KServe Hugging Face vLLM runtime supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference Sample OpenAI Completions request: ```bash curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "", "stream":false, "max_tokens": 30 }' -``` - -!!! success "Expected Output" - - ```{ .bash .no-copy } - {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":""}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}} - ``` - -Sample OpenAI Chat request: - -```bash -curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": ""}], "stream":false }' -``` -Sample OpenAI Completions request: - -```bash -curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "", "stream":false, "max_tokens": 30 }' ``` !!! success "Expected Output" ```{ .bash .no-copy } - {"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}} + {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":""}],"created":1715353182,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}} ``` Sample OpenAI Chat request: ```bash -curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d '{"instances": ["Where is Eiffel Tower?"] }' +curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": ""}], "stream":false }' ``` !!! success "Expected Output" ```{ .bash .no-copy } - {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":""}],"created":1715353182,"model":"llama2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}} + {"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama3","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}} ``` -Sample OpenAI Chat request: +### Serve the Hugging Face LLM model using HuggingFace Backend +You can use `--backend=huggingface` arg to perform the inference using Hugging Face. KServe Hugging Face backend runtime also +supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference. -```bash -curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": ""}], "stream":false }' +=== "Yaml" -``` -!!! success "Expected Output" + ```yaml + kubectl apply -f - <","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama2","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}} - ``` +### Hugging Face Runtime Arguments + +Below, you can find an explanation of command line arguments which are supported for Hugging Face runtime. [vLLM backend engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html) can also be specified on the command line argument which is parsed by the Hugging Face runtime. + +- `--model_name`: The name of the model used on the endpoint path. +- `--model_dir`: The local path where the model is downloaded to. If `model_id` is provided, this argument will be ignored. +- `--model_id`: Huggingface model id. +- `--model_revision`: Huggingface model revision. +- `--tokenizer_revision`: Huggingface tokenizer revision. +- `--dtype`: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'. + Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends. + Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads. +- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'. +- `--backend`: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'. +- `--max_length`: Max sequence length for the tokenizer. +- `--disable_lower_case`: Disable lower case for the tokenizer. +- `--disable_special_tokens`: The sequences will not be encoded with the special tokens relative to the model. +- `--trust_remote_code`: Allow loading of models and tokenizers with custom code. +- `--tensor_input_names`: The tensor input names passed to the model for triton inference server backend. +- `--return_token_type_ids`: Return token type ids. +- `--return_probabilities`: Return probabilities of predicted indexes. This is only applicable for tasks 'sequence_classification', 'token_classification' and 'fill_mask'. diff --git a/docs/reference/api.md b/docs/reference/api.md index 33d1802f6..62f2d573f 100644 --- a/docs/reference/api.md +++ b/docs/reference/api.md @@ -2045,7 +2045,7 @@ http:///v1/models/.metadata.name

Generated with gen-crd-api-reference-docs -on git commit 426fe21d. +on git commit 1c51eeee.

serving.kserve.io/v1beta1

@@ -2118,86 +2118,6 @@ ExplainerExtensionSpec -

AlibiExplainerSpec -

-

-(Appears on:ExplainerSpec) -

-
-

AlibiExplainerSpec defines the arguments for configuring an Alibi Explanation Server

-
- - - - - - - - - - - - - - - - - -
FieldDescription
-type
- - -AlibiExplainerType - - -
-

The type of Alibi explainer
-Valid values are:
-- “AnchorTabular”;
-- “AnchorImages”;
-- “AnchorText”;
-- “Counterfactuals”;
-- “Contrastive”;

-
-ExplainerExtensionSpec
- - -ExplainerExtensionSpec - - -
-

-(Members of ExplainerExtensionSpec are embedded into this type.) -

-

Contains fields shared across all explainers

-
-

AlibiExplainerType -(string alias)

-

-(Appears on:AlibiExplainerSpec) -

-
-

AlibiExplainerType is the explanation method

-
- - - - - - - - - - - - - - - - - - -
ValueDescription

"AnchorImages"

"AnchorTabular"

"AnchorText"

"Contrastive"

"Counterfactuals"

Batcher

@@ -2418,6 +2338,20 @@ map[string]string More info: http://kubernetes.io/docs/user-guide/annotations

+ + +deploymentStrategy
+ + +Kubernetes apps/v1.DeploymentStrategy + + + + +(Optional) +

The deployment strategy to use to replace existing pods with new ones. Only applicable for raw deployment mode.

+ +

ComponentImplementation @@ -2739,7 +2673,7 @@ string

ExplainerExtensionSpec

-(Appears on:ARTExplainerSpec, AlibiExplainerSpec) +(Appears on:ARTExplainerSpec)

ExplainerExtensionSpec defines configuration shared across all explainer frameworks

@@ -2838,19 +2772,6 @@ The following fields follow a “1-of” semantic. Users must specify ex -alibi
- - -AlibiExplainerSpec - - - - -

Spec for alibi explainer

- - - - art
@@ -2877,8 +2798,8 @@ PodSpec

This spec is dual purpose. 1) Users may choose to provide a full PodSpec for their custom explainer. -The field PodSpec.Containers is mutually exclusive with other explainers (i.e. Alibi). -2) Users may choose to provide a Explainer (i.e. Alibi) and specify PodSpec +The field PodSpec.Containers is mutually exclusive with other explainers. +2) Users may choose to provide a Explainer and specify PodSpec overrides in the PodSpec. They must not provide PodSpec.Containers in this case.

@@ -2917,18 +2838,6 @@ ComponentExtensionSpec -alibi
- -
-ExplainerConfig - - - - - - - - art
@@ -3460,6 +3369,16 @@ string +additionalIngressDomains
+ +[]string + + + + + + + domainTemplate
string @@ -5300,5 +5219,5 @@ PredictorExtensionSpec

Generated with gen-crd-api-reference-docs -on git commit 426fe21d. +on git commit 1c51eeee.