Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Update continuous batching and docker usage #1785

Merged
merged 2 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/getting_started/using_docker_image.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Prerequisites
=============
* The image can only run in an environment with GPUs and CUDA installed, because Xinference in the image relies on Nvidia GPUs for acceleration.
* CUDA must be successfully installed on the host machine. This can be determined by whether you can successfully execute the ``nvidia-smi`` command.
* The CUDA version in the docker image is ``12.1``, and the CUDA version on the host machine should ideally be consistent with it. Be sure to keep the CUDA version on your host machine between ``11.8`` and ``12.2``, even if it is inconsistent.
* The CUDA version in the docker image is ``12.4``, and the CUDA version on the host machine should be ``12.4`` or above, and the NVIDIA driver version should be ``550`` or above.


Docker Image
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: Xinference \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2024-04-30 14:54+0800\n"
"POT-Creation-Date: 2024-07-04 15:14+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
Expand All @@ -17,15 +17,15 @@ msgstr ""
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.14.0\n"
"Generated-By: Babel 2.11.0\n"

#: ../../source/getting_started/using_docker_image.rst:5
msgid "Xinference Docker Image"
msgstr "Docker 镜像"

#: ../../source/getting_started/using_docker_image.rst:7
msgid "Xinference provides official images for use on Dockerhub."
msgstr "Xinference 在 Dockerhub 中上传了官方镜像。"
msgstr "Xinference 在 Dockerhub 和 阿里云容器镜像服务 中上传了官方镜像。"

#: ../../source/getting_started/using_docker_image.rst:11
msgid "Prerequisites"
Expand All @@ -48,13 +48,12 @@ msgstr "保证 CUDA 在机器上正确安装。可以使用 ``nvidia-smi`` 检

#: ../../source/getting_started/using_docker_image.rst:14
msgid ""
"The CUDA version in the docker image is ``12.1``, and the CUDA version on"
" the host machine should ideally be consistent with it. Be sure to keep "
"the CUDA version on your host machine between ``11.8`` and ``12.2``, even"
" if it is inconsistent."
"The CUDA version in the docker image is ``12.4``, and the CUDA version on"
" the host machine should be ``12.4`` or above, and the NVIDIA driver "
"version should be ``550`` or above."
msgstr ""
"镜像中的 CUDA 版本是 ``12.1`` ,推荐机器上的版本与之保持一致。如果不一致"
",需要保证CUDA 版本在 ``11.8`` ``12.2`` 之间。"
"镜像中的 CUDA 版本为 ``12.4`` 。为了不出现预期之外的问题,请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别"
"升级到 ``12.4`` ``550`` 以上。"

#: ../../source/getting_started/using_docker_image.rst:18
msgid "Docker Image"
Expand All @@ -65,7 +64,10 @@ msgid ""
"The official image of Xinference is available on DockerHub in the "
"repository ``xprobe/xinference``. Available tags include:"
msgstr ""
"Xinference 的官方镜像在 Dockerhub 的 ``xprobe/xinference`` 仓库里。目前可用版本包括:"
"当前,可以通过两个渠道拉取 Xinference 的官方镜像。"
"1. 在 Dockerhub 的 ``xprobe/xinference`` 仓库里。"
"2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中,供访问 Dockerhub 有困难的用户拉取。"
"拉取命令:``docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:<tag>`` 。目前可用的标签包括:"

#: ../../source/getting_started/using_docker_image.rst:22
msgid ""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ msgid ""
msgstr ""
"Project-Id-Version: Xinference \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2024-06-07 14:38+0800\n"
"POT-Creation-Date: 2024-07-04 16:08+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <[email protected]>\n"
Expand All @@ -28,8 +28,8 @@ msgid ""
" Xinference aims to provide this optimization capability when using the "
"transformers engine as well."
msgstr ""
"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference 旨在"
"通过这项技术提升 ``transformers`` 推理引擎的吞吐。"
"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference "
"旨在通过这项技术提升 ``transformers`` 推理引擎的吞吐。"

#: ../../source/user_guide/continuous_batching.rst:11
msgid "Usage"
Expand All @@ -45,49 +45,98 @@ msgid ""
"``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting "
"xinference. For example:"
msgstr ""
"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` 置为 ``1`` 。"
"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_"
"BATCHING`` 置为 ``1`` 。"

#: ../../source/user_guide/continuous_batching.rst:21
msgid ""
"Then, ensure that the ``transformers`` engine is selected when launching "
"the model. For example:"
msgstr ""
"然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:"
msgstr "然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:"

#: ../../source/user_guide/continuous_batching.rst:57
msgid ""
"Once this feature is enabled, all ``chat`` requests will be managed by "
"Once this feature is enabled, all requests for LLMs will be managed by "
"continuous batching, and the average throughput of requests made to a "
"single model will increase. The usage of the ``chat`` interface remains "
"single model will increase. The usage of the LLM interface remains "
"exactly the same as before, with no differences."
msgstr ""
"一旦此功能开启,``chat`` 接口将被此功能接管,别的接口不受影响。``chat`` 接口的使用方式没有任何变化。"
"一旦此功能开启,LLM 模型的所有接口将被此功能接管。所有接口的使用方式没有"
"任何变化。"

#: ../../source/user_guide/continuous_batching.rst:63
msgid "Abort your request"
msgstr "中止请求"

#: ../../source/user_guide/continuous_batching.rst:64
msgid "In this mode, you can abort requests that are in the process of inference."
msgstr ""
"此功能中,你可以优雅地中止正在推理中的请求。"

#: ../../source/user_guide/continuous_batching.rst:66
msgid "First, add ``request_id`` option in ``generate_config``. For example:"
msgstr ""
"首先,在推理请求的 ``generate_config`` 中指定 ``request_id`` 选项。例如:"

#: ../../source/user_guide/continuous_batching.rst:75
msgid ""
"Then, abort the request using the ``request_id`` you have set. For "
"example:"
msgstr ""
"接着,带着你指定的 ``request_id`` 去中止该请求。例如:"

#: ../../source/user_guide/continuous_batching.rst:62
#: ../../source/user_guide/continuous_batching.rst:83
msgid ""
"Note that if your request has already finished, aborting the request will"
" be a no-op."
msgstr ""
"注意,如果你的请求已经结束,那么此操作将什么都不做。"

#: ../../source/user_guide/continuous_batching.rst:86
msgid "Note"
msgstr "注意事项"

#: ../../source/user_guide/continuous_batching.rst:64
#: ../../source/user_guide/continuous_batching.rst:88
msgid ""
"Currently, this feature only supports the ``chat`` interface for ``LLM`` "
"models."
msgstr "当前,此功能仅支持 LLM 模型的 ``chat`` 功能。"
"Currently, this feature only supports the ``generate``, ``chat`` and "
"``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not "
"supported."
msgstr ""
"当前,此功能仅支持 LLM 模型的 ``generate``, ``chat`` 和 ``vision`` (多"
"模态) 功能。``tool call`` (工具调用)暂时不支持。"

#: ../../source/user_guide/continuous_batching.rst:66
#: ../../source/user_guide/continuous_batching.rst:90
msgid ""
"For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and "
"``glm-4v`` models are supported. More models will be supported in the "
"future. Please let us know your requirements."
msgstr ""
"对于多模态任务,当前支持 ``qwen-vl-chat`` ,``cogvlm2`` 和 ``glm-4v`` "
"模型。未来将加入更多模型,敬请期待。"

#: ../../source/user_guide/continuous_batching.rst:92
msgid ""
"If using GPU inference, this method will consume more GPU memory. Please "
"be cautious when increasing the number of concurrent requests to the same"
" model. The ``launch_model`` interface provides the ``max_num_seqs`` "
"parameter to adjust the concurrency level, with a default value of "
"``16``."
msgstr ""
"如果使用 GPU 推理,此功能对显存要求较高。因此请谨慎提高对同一个模型的并发请求量。"
"``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度,默认值为 ``16`` 。"
"如果使用 GPU 推理,此功能对显存要求较高。因此请谨慎提高对同一个模型的并发"
"请求量。``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度"
",默认值为 ``16`` 。"

#: ../../source/user_guide/continuous_batching.rst:69
#: ../../source/user_guide/continuous_batching.rst:95
msgid ""
"This feature is still in the experimental stage, and we welcome your "
"active feedback on any issues."
msgstr "此功能仍处于实验阶段,欢迎反馈任何问题。"

#: ../../source/user_guide/continuous_batching.rst:97
msgid ""
"After a period of testing, this method will remain enabled by default, "
"and the original inference method will be deprecated."
msgstr ""
"此功能仍处于实验阶段,欢迎反馈任何问题。"
"一段时间的测试之后,此功能将代替原来的 transformers 推理逻辑成为默认行为"
"。原来的推理逻辑将被摒弃。"

34 changes: 31 additions & 3 deletions doc/source/user_guide/continuous_batching.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,16 +54,44 @@ Currently, this feature can be enabled under the following conditions:
print('Model uid: ' + model_uid)


Once this feature is enabled, all ``chat`` requests will be managed by continuous batching,
Once this feature is enabled, all requests for LLMs will be managed by continuous batching,
and the average throughput of requests made to a single model will increase.
The usage of the ``chat`` interface remains exactly the same as before, with no differences.
The usage of the LLM interface remains exactly the same as before, with no differences.


Abort your request
==================
In this mode, you can abort requests that are in the process of inference.

#. First, add ``request_id`` option in ``generate_config``. For example:

.. code-block:: bash

from xinference.client import Client
client = Client("http://127.0.0.1:9997")
model = client.get_model("<model_uid>")
model.chat("<prompt>", generate_config={"request_id": "<your_unique_request_id>"})

#. Then, abort the request using the ``request_id`` you have set. For example:

.. code-block:: bash

from xinference.client import Client
client = Client("http://127.0.0.1:9997")
client.abort_request("<model_uid>", "<your_unique_request_id>")

Note that if your request has already finished, aborting the request will be a no-op.

Note
====

* Currently, this feature only supports the ``chat`` interface for ``LLM`` models.
* Currently, this feature only supports the ``generate``, ``chat`` and ``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not supported.

* For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and ``glm-4v`` models are supported. More models will be supported in the future. Please let us know your requirements.

* If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model.
The ``launch_model`` interface provides the ``max_num_seqs`` parameter to adjust the concurrency level, with a default value of ``16``.

* This feature is still in the experimental stage, and we welcome your active feedback on any issues.

* After a period of testing, this method will remain enabled by default, and the original inference method will be deprecated.
Loading