diff --git a/.gitignore b/.gitignore index 648fa3a462..70cb42a829 100644 --- a/.gitignore +++ b/.gitignore @@ -142,6 +142,9 @@ static/ # doc doc/source/savefig/ +# local env +local_env + asv/results .DS_Store diff --git a/doc/source/getting_started/installation.rst b/doc/source/getting_started/installation.rst index bbb6e89e57..c82f2698c8 100644 --- a/doc/source/getting_started/installation.rst +++ b/doc/source/getting_started/installation.rst @@ -39,7 +39,7 @@ Currently, supported models include: .. vllm_start -- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct`` +- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-3.2-vision``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct`` - ``mistral-v0.1``, ``mistral-instruct-v0.1``, ``mistral-instruct-v0.2``, ``mistral-instruct-v0.3``, ``mistral-nemo-instruct``, ``mistral-large-instruct`` - ``codestral-v0.1`` - ``Yi``, ``Yi-1.5``, ``Yi-chat``, ``Yi-1.5-chat``, ``Yi-1.5-chat-16k`` diff --git a/doc/source/models/builtin/audio/index.rst b/doc/source/models/builtin/audio/index.rst index b89eaf41f6..2028757c87 100644 --- a/doc/source/models/builtin/audio/index.rst +++ b/doc/source/models/builtin/audio/index.rst @@ -31,21 +31,41 @@ The following is a list of built-in audio models in Xinference: whisper-base + whisper-base-mlx + whisper-base.en + whisper-base.en-mlx + whisper-large-v3 + whisper-large-v3-mlx + whisper-large-v3-turbo + whisper-large-v3-turbo-mlx + whisper-medium + whisper-medium-mlx + whisper-medium.en + whisper-medium.en-mlx + whisper-small + whisper-small-mlx + whisper-small.en + whisper-small.en-mlx + whisper-tiny + whisper-tiny-mlx + whisper-tiny.en + + whisper-tiny.en-mlx \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-base-mlx.rst b/doc/source/models/builtin/audio/whisper-base-mlx.rst new file mode 100644 index 0000000000..ad0adb24d6 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-base-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-base-mlx: + +================ +whisper-base-mlx +================ + +- **Model Name:** whisper-base-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-base-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-base-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-base.en-mlx.rst b/doc/source/models/builtin/audio/whisper-base.en-mlx.rst new file mode 100644 index 0000000000..9e11de619c --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-base.en-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-base.en-mlx: + +=================== +whisper-base.en-mlx +=================== + +- **Model Name:** whisper-base.en-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** False + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-base.en-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-base.en-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-large-v3-mlx.rst b/doc/source/models/builtin/audio/whisper-large-v3-mlx.rst new file mode 100644 index 0000000000..fe9777bb4f --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-large-v3-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-large-v3-mlx: + +==================== +whisper-large-v3-mlx +==================== + +- **Model Name:** whisper-large-v3-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-large-v3-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-large-v3-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-large-v3-turbo-mlx.rst b/doc/source/models/builtin/audio/whisper-large-v3-turbo-mlx.rst new file mode 100644 index 0000000000..647b24a678 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-large-v3-turbo-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-large-v3-turbo-mlx: + +========================== +whisper-large-v3-turbo-mlx +========================== + +- **Model Name:** whisper-large-v3-turbo-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-large-v3-turbo + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-large-v3-turbo-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-medium-mlx.rst b/doc/source/models/builtin/audio/whisper-medium-mlx.rst new file mode 100644 index 0000000000..d06d5b0a77 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-medium-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-medium-mlx: + +================== +whisper-medium-mlx +================== + +- **Model Name:** whisper-medium-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-medium-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-medium-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-medium.en-mlx.rst b/doc/source/models/builtin/audio/whisper-medium.en-mlx.rst new file mode 100644 index 0000000000..da48274b95 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-medium.en-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-medium.en-mlx: + +===================== +whisper-medium.en-mlx +===================== + +- **Model Name:** whisper-medium.en-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** False + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-medium.en-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-medium.en-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-small-mlx.rst b/doc/source/models/builtin/audio/whisper-small-mlx.rst new file mode 100644 index 0000000000..fc7dc22ca4 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-small-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-small-mlx: + +================= +whisper-small-mlx +================= + +- **Model Name:** whisper-small-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-small-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-small-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-small.en-mlx.rst b/doc/source/models/builtin/audio/whisper-small.en-mlx.rst new file mode 100644 index 0000000000..5105c633e2 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-small.en-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-small.en-mlx: + +==================== +whisper-small.en-mlx +==================== + +- **Model Name:** whisper-small.en-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** False + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-small.en-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-small.en-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-tiny-mlx.rst b/doc/source/models/builtin/audio/whisper-tiny-mlx.rst new file mode 100644 index 0000000000..355dacbd3d --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-tiny-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-tiny-mlx: + +================ +whisper-tiny-mlx +================ + +- **Model Name:** whisper-tiny-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** True + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-tiny + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-tiny-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/audio/whisper-tiny.en-mlx.rst b/doc/source/models/builtin/audio/whisper-tiny.en-mlx.rst new file mode 100644 index 0000000000..c023969951 --- /dev/null +++ b/doc/source/models/builtin/audio/whisper-tiny.en-mlx.rst @@ -0,0 +1,19 @@ +.. _models_builtin_whisper-tiny.en-mlx: + +=================== +whisper-tiny.en-mlx +=================== + +- **Model Name:** whisper-tiny.en-mlx +- **Model Family:** whisper +- **Abilities:** audio-to-text +- **Multilingual:** False + +Specifications +^^^^^^^^^^^^^^ + +- **Model ID:** mlx-community/whisper-tiny.en-mlx + +Execute the following command to launch the model:: + + xinference launch --model-name whisper-tiny.en-mlx --model-type audio \ No newline at end of file diff --git a/doc/source/models/builtin/embedding/gte-qwen2.rst b/doc/source/models/builtin/embedding/gte-qwen2.rst index cec2de89b6..a88fdece9d 100644 --- a/doc/source/models/builtin/embedding/gte-qwen2.rst +++ b/doc/source/models/builtin/embedding/gte-qwen2.rst @@ -18,4 +18,4 @@ Specifications Execute the following command to launch the model:: - xinference launch --model-name gte-Qwen2 --model-type embedding + xinference launch --model-name gte-Qwen2 --model-type embedding \ No newline at end of file diff --git a/doc/source/models/builtin/llm/index.rst b/doc/source/models/builtin/llm/index.rst index bddb6052f5..34018b8ffe 100644 --- a/doc/source/models/builtin/llm/index.rst +++ b/doc/source/models/builtin/llm/index.rst @@ -240,16 +240,16 @@ The following is a list of built-in LLM in Xinference: - chat, tools - 131072 - The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.. - + * - :ref:`llama-3.2-vision ` - generate, vision - 131072 - - The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out)... - + - The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image... + * - :ref:`llama-3.2-vision-instruct ` - chat, vision - 131072 - - The Llama 3.2-Vision-instruct instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks... + - Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image... * - :ref:`minicpm-2b-dpo-bf16 ` - chat @@ -641,6 +641,10 @@ The following is a list of built-in LLM in Xinference: llama-3.1-instruct + llama-3.2-vision + + llama-3.2-vision-instruct + minicpm-2b-dpo-bf16 minicpm-2b-dpo-fp16 diff --git a/doc/source/models/builtin/llm/llama-3.1-instruct.rst b/doc/source/models/builtin/llm/llama-3.1-instruct.rst index 350e333bb3..747c714856 100644 --- a/doc/source/models/builtin/llm/llama-3.1-instruct.rst +++ b/doc/source/models/builtin/llm/llama-3.1-instruct.rst @@ -7,7 +7,7 @@ llama-3.1-instruct - **Context Length:** 131072 - **Model Name:** llama-3.1-instruct - **Languages:** en, de, fr, it, pt, hi, es, th -- **Abilities:** chat +- **Abilities:** chat, tools - **Description:** The Llama 3.1 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.. Specifications diff --git a/doc/source/models/builtin/llm/llama-3.2-vision-instruct.rst b/doc/source/models/builtin/llm/llama-3.2-vision-instruct.rst index f1a988d465..6994bcc8cf 100644 --- a/doc/source/models/builtin/llm/llama-3.2-vision-instruct.rst +++ b/doc/source/models/builtin/llm/llama-3.2-vision-instruct.rst @@ -8,11 +8,12 @@ llama-3.2-vision-instruct - **Model Name:** llama-3.2-vision-instruct - **Languages:** en, de, fr, it, pt, hi, es, th - **Abilities:** chat, vision -- **Description:** The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks... +- **Description:** Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image... Specifications ^^^^^^^^^^^^^^ + Model Spec 1 (pytorch, 11 Billion) ++++++++++++++++++++++++++++++++++++++++ @@ -26,8 +27,8 @@ Model Spec 1 (pytorch, 11 Billion) Execute the following command to launch the model, remember to replace ``${quantization}`` with your chosen quantization method from the options listed above:: - xinference launch --model-engine transformers --model-name llama-3.2-vision-instruct --size-in-billions 11 --model-format pytorch --quantization ${quantization} - xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision-instruct --size-in-billions 11 --model-format pytorch + xinference launch --model-engine ${engine} --model-name llama-3.2-vision-instruct --size-in-billions 11 --model-format pytorch --quantization ${quantization} + Model Spec 2 (pytorch, 90 Billion) ++++++++++++++++++++++++++++++++++++++++ @@ -42,6 +43,5 @@ Model Spec 2 (pytorch, 90 Billion) Execute the following command to launch the model, remember to replace ``${quantization}`` with your chosen quantization method from the options listed above:: - xinference launch --model-engine transformers --model-name llama-3.2-vision-instruct --size-in-billions 90 --model-format pytorch --quantization ${quantization} - xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision-instruct --size-in-billions 90 --model-format pytorch + xinference launch --model-engine ${engine} --model-name llama-3.2-vision-instruct --size-in-billions 90 --model-format pytorch --quantization ${quantization} diff --git a/doc/source/models/builtin/llm/llama-3.2-vision.rst b/doc/source/models/builtin/llm/llama-3.2-vision.rst index 1d47adb303..2e131d325b 100644 --- a/doc/source/models/builtin/llm/llama-3.2-vision.rst +++ b/doc/source/models/builtin/llm/llama-3.2-vision.rst @@ -1,18 +1,19 @@ .. _models_llm_llama-3.2-vision: -================ +======================================== llama-3.2-vision -================ +======================================== - **Context Length:** 131072 - **Model Name:** llama-3.2-vision - **Languages:** en, de, fr, it, pt, hi, es, th - **Abilities:** generate, vision -- **Description:** The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks... +- **Description:** The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image... Specifications ^^^^^^^^^^^^^^ + Model Spec 1 (pytorch, 11 Billion) ++++++++++++++++++++++++++++++++++++++++ @@ -21,13 +22,13 @@ Model Spec 1 (pytorch, 11 Billion) - **Quantizations:** none - **Engines**: vLLM, Transformers - **Model ID:** meta-llama/Meta-Llama-3.2-11B-Vision -- **Model Hubs**: `Hugging Face `__, `ModelScope `__ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ Execute the following command to launch the model, remember to replace ``${quantization}`` with your chosen quantization method from the options listed above:: - xinference launch --model-engine transformers --model-name llama-3.2-vision --size-in-billions 11 --model-format pytorch --quantization ${quantization} - xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision --size-in-billions 11 --model-format pytorch + xinference launch --model-engine ${engine} --model-name llama-3.2-vision --size-in-billions 11 --model-format pytorch --quantization ${quantization} + Model Spec 2 (pytorch, 90 Billion) ++++++++++++++++++++++++++++++++++++++++ @@ -37,11 +38,10 @@ Model Spec 2 (pytorch, 90 Billion) - **Quantizations:** none - **Engines**: vLLM, Transformers - **Model ID:** meta-llama/Meta-Llama-3.2-90B-Vision -- **Model Hubs**: `Hugging Face `__, `ModelScope `__ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ Execute the following command to launch the model, remember to replace ``${quantization}`` with your chosen quantization method from the options listed above:: - xinference launch --model-engine transformers --model-name llama-3.2-vision --size-in-billions 90 --model-format pytorch --quantization ${quantization} - xinference launch --model-engine vllm --enforce_eager --max_num_seqs 16 --model-name llama-3.2-vision --size-in-billions 90 --model-format pytorch + xinference launch --model-engine ${engine} --model-name llama-3.2-vision --size-in-billions 90 --model-format pytorch --quantization ${quantization} diff --git a/doc/source/models/builtin/llm/qwen2-vl-instruct.rst b/doc/source/models/builtin/llm/qwen2-vl-instruct.rst index 0872ea0168..c4b4b9f730 100644 --- a/doc/source/models/builtin/llm/qwen2-vl-instruct.rst +++ b/doc/source/models/builtin/llm/qwen2-vl-instruct.rst @@ -20,7 +20,7 @@ Model Spec 1 (pytorch, 2 Billion) - **Model Format:** pytorch - **Model Size (in billions):** 2 - **Quantizations:** none -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-2B-Instruct - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -36,7 +36,7 @@ Model Spec 2 (gptq, 2 Billion) - **Model Format:** gptq - **Model Size (in billions):** 2 - **Quantizations:** Int8 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8 - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -52,7 +52,7 @@ Model Spec 3 (gptq, 2 Billion) - **Model Format:** gptq - **Model Size (in billions):** 2 - **Quantizations:** Int4 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4 - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -68,7 +68,7 @@ Model Spec 4 (awq, 2 Billion) - **Model Format:** awq - **Model Size (in billions):** 2 - **Quantizations:** Int4 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-2B-Instruct-AWQ - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -84,7 +84,7 @@ Model Spec 5 (pytorch, 7 Billion) - **Model Format:** pytorch - **Model Size (in billions):** 7 - **Quantizations:** none -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-7B-Instruct - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -100,7 +100,7 @@ Model Spec 6 (gptq, 7 Billion) - **Model Format:** gptq - **Model Size (in billions):** 7 - **Quantizations:** Int8 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8 - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -116,7 +116,7 @@ Model Spec 7 (gptq, 7 Billion) - **Model Format:** gptq - **Model Size (in billions):** 7 - **Quantizations:** Int4 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4 - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -132,7 +132,7 @@ Model Spec 8 (awq, 7 Billion) - **Model Format:** awq - **Model Size (in billions):** 7 - **Quantizations:** Int4 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-7B-Instruct-AWQ - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -148,7 +148,7 @@ Model Spec 9 (pytorch, 72 Billion) - **Model Format:** pytorch - **Model Size (in billions):** 72 - **Quantizations:** none -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-72B-Instruct - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -164,7 +164,7 @@ Model Spec 10 (awq, 72 Billion) - **Model Format:** awq - **Model Size (in billions):** 72 - **Quantizations:** Int4 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-72B-Instruct-AWQ - **Model Hubs**: `Hugging Face `__, `ModelScope `__ @@ -180,7 +180,7 @@ Model Spec 11 (gptq, 72 Billion) - **Model Format:** gptq - **Model Size (in billions):** 72 - **Quantizations:** Int4, Int8 -- **Engines**: Transformers +- **Engines**: vLLM, Transformers - **Model ID:** Qwen/Qwen2-VL-72B-Instruct-GPTQ-{quantization} - **Model Hubs**: `Hugging Face `__, `ModelScope `__ diff --git a/doc/source/models/builtin/llm/qwen2.5-coder-instruct.rst b/doc/source/models/builtin/llm/qwen2.5-coder-instruct.rst index 74614b4f0b..cdcb47b33c 100644 --- a/doc/source/models/builtin/llm/qwen2.5-coder-instruct.rst +++ b/doc/source/models/builtin/llm/qwen2.5-coder-instruct.rst @@ -14,7 +14,23 @@ Specifications ^^^^^^^^^^^^^^ -Model Spec 1 (pytorch, 1_5 Billion) +Model Spec 1 (pytorch, 0_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 0_5 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-0.5B-Instruct +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 0_5 --model-format pytorch --quantization ${quantization} + + +Model Spec 2 (pytorch, 1_5 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** pytorch @@ -30,7 +46,23 @@ chosen quantization method from the options listed above:: xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 1_5 --model-format pytorch --quantization ${quantization} -Model Spec 2 (pytorch, 7 Billion) +Model Spec 3 (pytorch, 3 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 3 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-3B-Instruct +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 3 --model-format pytorch --quantization ${quantization} + + +Model Spec 4 (pytorch, 7 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** pytorch @@ -46,7 +78,231 @@ chosen quantization method from the options listed above:: xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 7 --model-format pytorch --quantization ${quantization} -Model Spec 3 (ggufv2, 1_5 Billion) +Model Spec 5 (pytorch, 14 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 14 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-14B-Instruct +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 14 --model-format pytorch --quantization ${quantization} + + +Model Spec 6 (pytorch, 32 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 32 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-32B-Instruct +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 32 --model-format pytorch --quantization ${quantization} + + +Model Spec 7 (gptq, 0_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 0_5 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-0.5B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 0_5 --model-format gptq --quantization ${quantization} + + +Model Spec 8 (gptq, 1_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 1_5 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 1_5 --model-format gptq --quantization ${quantization} + + +Model Spec 9 (gptq, 3 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 3 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 3 --model-format gptq --quantization ${quantization} + + +Model Spec 10 (gptq, 7 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 7 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 7 --model-format gptq --quantization ${quantization} + + +Model Spec 11 (gptq, 14 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 14 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 14 --model-format gptq --quantization ${quantization} + + +Model Spec 12 (gptq, 32 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** gptq +- **Model Size (in billions):** 32 +- **Quantizations:** Int4, Int8 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-{quantization} +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 32 --model-format gptq --quantization ${quantization} + + +Model Spec 13 (awq, 0_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 0_5 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-0.5B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 0_5 --model-format awq --quantization ${quantization} + + +Model Spec 14 (awq, 1_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 1_5 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 1_5 --model-format awq --quantization ${quantization} + + +Model Spec 15 (awq, 3 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 3 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-3B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 3 --model-format awq --quantization ${quantization} + + +Model Spec 16 (awq, 7 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 7 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-7B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 7 --model-format awq --quantization ${quantization} + + +Model Spec 17 (awq, 14 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 14 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-14B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 14 --model-format awq --quantization ${quantization} + + +Model Spec 18 (awq, 32 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** awq +- **Model Size (in billions):** 32 +- **Quantizations:** Int4 +- **Engines**: vLLM, Transformers, SGLang +- **Model ID:** Qwen/Qwen2.5-Coder-32B-Instruct-AWQ +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 32 --model-format awq --quantization ${quantization} + + +Model Spec 19 (ggufv2, 1_5 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** ggufv2 @@ -62,7 +318,7 @@ chosen quantization method from the options listed above:: xinference launch --model-engine ${engine} --model-name qwen2.5-coder-instruct --size-in-billions 1_5 --model-format ggufv2 --quantization ${quantization} -Model Spec 4 (ggufv2, 7 Billion) +Model Spec 20 (ggufv2, 7 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** ggufv2 diff --git a/doc/source/models/builtin/llm/qwen2.5-coder.rst b/doc/source/models/builtin/llm/qwen2.5-coder.rst index 8ae4709930..a0c05cf500 100644 --- a/doc/source/models/builtin/llm/qwen2.5-coder.rst +++ b/doc/source/models/builtin/llm/qwen2.5-coder.rst @@ -14,7 +14,23 @@ Specifications ^^^^^^^^^^^^^^ -Model Spec 1 (pytorch, 1_5 Billion) +Model Spec 1 (pytorch, 0_5 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 0_5 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-0.5B +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 0_5 --model-format pytorch --quantization ${quantization} + + +Model Spec 2 (pytorch, 1_5 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** pytorch @@ -30,7 +46,23 @@ chosen quantization method from the options listed above:: xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 1_5 --model-format pytorch --quantization ${quantization} -Model Spec 2 (pytorch, 7 Billion) +Model Spec 3 (pytorch, 3 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 3 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-3B +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 3 --model-format pytorch --quantization ${quantization} + + +Model Spec 4 (pytorch, 7 Billion) ++++++++++++++++++++++++++++++++++++++++ - **Model Format:** pytorch @@ -45,3 +77,35 @@ chosen quantization method from the options listed above:: xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 7 --model-format pytorch --quantization ${quantization} + +Model Spec 5 (pytorch, 14 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 14 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-14B +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 14 --model-format pytorch --quantization ${quantization} + + +Model Spec 6 (pytorch, 32 Billion) +++++++++++++++++++++++++++++++++++++++++ + +- **Model Format:** pytorch +- **Model Size (in billions):** 32 +- **Quantizations:** 4-bit, 8-bit, none +- **Engines**: vLLM, Transformers, SGLang (vLLM and SGLang only available for quantization none) +- **Model ID:** Qwen/Qwen2.5-Coder-32B +- **Model Hubs**: `Hugging Face `__, `ModelScope `__ + +Execute the following command to launch the model, remember to replace ``${quantization}`` with your +chosen quantization method from the options listed above:: + + xinference launch --model-engine ${engine} --model-name qwen2.5-coder --size-in-billions 32 --model-format pytorch --quantization ${quantization} + diff --git a/doc/source/user_guide/backends.rst b/doc/source/user_guide/backends.rst index d215c7c63b..b8a669e0fb 100644 --- a/doc/source/user_guide/backends.rst +++ b/doc/source/user_guide/backends.rst @@ -46,7 +46,7 @@ Currently, supported model includes: .. vllm_start -- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct`` +- ``llama-2``, ``llama-3``, ``llama-3.1``, ``llama-3.2-vision``, ``llama-2-chat``, ``llama-3-instruct``, ``llama-3.1-instruct`` - ``mistral-v0.1``, ``mistral-instruct-v0.1``, ``mistral-instruct-v0.2``, ``mistral-instruct-v0.3``, ``mistral-nemo-instruct``, ``mistral-large-instruct`` - ``codestral-v0.1`` - ``Yi``, ``Yi-1.5``, ``Yi-chat``, ``Yi-1.5-chat``, ``Yi-1.5-chat-16k``