diff --git a/doc/source/getting_started/using_docker_image.rst b/doc/source/getting_started/using_docker_image.rst index 896014bbb8..7f335e3b31 100644 --- a/doc/source/getting_started/using_docker_image.rst +++ b/doc/source/getting_started/using_docker_image.rst @@ -11,7 +11,7 @@ Prerequisites ============= * The image can only run in an environment with GPUs and CUDA installed, because Xinference in the image relies on Nvidia GPUs for acceleration. * CUDA must be successfully installed on the host machine. This can be determined by whether you can successfully execute the ``nvidia-smi`` command. -* The CUDA version in the docker image is ``12.1``, and the CUDA version on the host machine should ideally be consistent with it. Be sure to keep the CUDA version on your host machine between ``11.8`` and ``12.2``, even if it is inconsistent. +* The CUDA version in the docker image is ``12.4``, and the CUDA version on the host machine should be ``12.4`` or above, and the NVIDIA driver version should be ``550`` or above. Docker Image diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po b/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po index f1b381d763..ae41dc7823 100644 --- a/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po +++ b/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po @@ -8,7 +8,7 @@ msgid "" msgstr "" "Project-Id-Version: Xinference \n" "Report-Msgid-Bugs-To: \n" -"POT-Creation-Date: 2024-04-30 14:54+0800\n" +"POT-Creation-Date: 2024-07-04 15:14+0800\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language: zh_CN\n" @@ -17,7 +17,7 @@ msgstr "" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=utf-8\n" "Content-Transfer-Encoding: 8bit\n" -"Generated-By: Babel 2.14.0\n" +"Generated-By: Babel 2.11.0\n" #: ../../source/getting_started/using_docker_image.rst:5 msgid "Xinference Docker Image" @@ -25,7 +25,7 @@ msgstr "Docker 镜像" #: ../../source/getting_started/using_docker_image.rst:7 msgid "Xinference provides official images for use on Dockerhub." -msgstr "Xinference 在 Dockerhub 中上传了官方镜像。" +msgstr "Xinference 在 Dockerhub 和 阿里云容器镜像服务 中上传了官方镜像。" #: ../../source/getting_started/using_docker_image.rst:11 msgid "Prerequisites" @@ -48,13 +48,12 @@ msgstr "保证 CUDA 在机器上正确安装。可以使用 ``nvidia-smi`` 检 #: ../../source/getting_started/using_docker_image.rst:14 msgid "" -"The CUDA version in the docker image is ``12.1``, and the CUDA version on" -" the host machine should ideally be consistent with it. Be sure to keep " -"the CUDA version on your host machine between ``11.8`` and ``12.2``, even" -" if it is inconsistent." +"The CUDA version in the docker image is ``12.4``, and the CUDA version on" +" the host machine should be ``12.4`` or above, and the NVIDIA driver " +"version should be ``550`` or above." msgstr "" -"镜像中的 CUDA 版本是 ``12.1`` ,推荐机器上的版本与之保持一致。如果不一致" -",需要保证CUDA 版本在 ``11.8`` 与 ``12.2`` 之间。" +"镜像中的 CUDA 版本为 ``12.4`` 。为了不出现预期之外的问题,请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别" +"升级到 ``12.4`` 和 ``550`` 以上。" #: ../../source/getting_started/using_docker_image.rst:18 msgid "Docker Image" @@ -65,7 +64,10 @@ msgid "" "The official image of Xinference is available on DockerHub in the " "repository ``xprobe/xinference``. Available tags include:" msgstr "" -"Xinference 的官方镜像在 Dockerhub 的 ``xprobe/xinference`` 仓库里。目前可用版本包括:" +"当前,可以通过两个渠道拉取 Xinference 的官方镜像。" +"1. 在 Dockerhub 的 ``xprobe/xinference`` 仓库里。" +"2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中,供访问 Dockerhub 有困难的用户拉取。" +"拉取命令:``docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:`` 。目前可用的标签包括:" #: ../../source/getting_started/using_docker_image.rst:22 msgid "" diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po b/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po index b192ebc6df..427e855a09 100644 --- a/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po +++ b/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po @@ -8,7 +8,7 @@ msgid "" msgstr "" "Project-Id-Version: Xinference \n" "Report-Msgid-Bugs-To: \n" -"POT-Creation-Date: 2024-06-07 14:38+0800\n" +"POT-Creation-Date: 2024-07-04 16:08+0800\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language-Team: LANGUAGE \n" @@ -28,8 +28,8 @@ msgid "" " Xinference aims to provide this optimization capability when using the " "transformers engine as well." msgstr "" -"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference 旨在" -"通过这项技术提升 ``transformers`` 推理引擎的吞吐。" +"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference " +"旨在通过这项技术提升 ``transformers`` 推理引擎的吞吐。" #: ../../source/user_guide/continuous_batching.rst:11 msgid "Usage" @@ -45,35 +45,76 @@ msgid "" "``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting " "xinference. For example:" msgstr "" -"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` 置为 ``1`` 。" +"首先,启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_" +"BATCHING`` 置为 ``1`` 。" #: ../../source/user_guide/continuous_batching.rst:21 msgid "" "Then, ensure that the ``transformers`` engine is selected when launching " "the model. For example:" -msgstr "" -"然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:" +msgstr "然后,启动 LLM 模型时选择 ``transformers`` 推理引擎。例如:" #: ../../source/user_guide/continuous_batching.rst:57 msgid "" -"Once this feature is enabled, all ``chat`` requests will be managed by " +"Once this feature is enabled, all requests for LLMs will be managed by " "continuous batching, and the average throughput of requests made to a " -"single model will increase. The usage of the ``chat`` interface remains " +"single model will increase. The usage of the LLM interface remains " "exactly the same as before, with no differences." msgstr "" -"一旦此功能开启,``chat`` 接口将被此功能接管,别的接口不受影响。``chat`` 接口的使用方式没有任何变化。" +"一旦此功能开启,LLM 模型的所有接口将被此功能接管。所有接口的使用方式没有" +"任何变化。" + +#: ../../source/user_guide/continuous_batching.rst:63 +msgid "Abort your request" +msgstr "中止请求" + +#: ../../source/user_guide/continuous_batching.rst:64 +msgid "In this mode, you can abort requests that are in the process of inference." +msgstr "" +"此功能中,你可以优雅地中止正在推理中的请求。" + +#: ../../source/user_guide/continuous_batching.rst:66 +msgid "First, add ``request_id`` option in ``generate_config``. For example:" +msgstr "" +"首先,在推理请求的 ``generate_config`` 中指定 ``request_id`` 选项。例如:" + +#: ../../source/user_guide/continuous_batching.rst:75 +msgid "" +"Then, abort the request using the ``request_id`` you have set. For " +"example:" +msgstr "" +"接着,带着你指定的 ``request_id`` 去中止该请求。例如:" -#: ../../source/user_guide/continuous_batching.rst:62 +#: ../../source/user_guide/continuous_batching.rst:83 +msgid "" +"Note that if your request has already finished, aborting the request will" +" be a no-op." +msgstr "" +"注意,如果你的请求已经结束,那么此操作将什么都不做。" + +#: ../../source/user_guide/continuous_batching.rst:86 msgid "Note" msgstr "注意事项" -#: ../../source/user_guide/continuous_batching.rst:64 +#: ../../source/user_guide/continuous_batching.rst:88 msgid "" -"Currently, this feature only supports the ``chat`` interface for ``LLM`` " -"models." -msgstr "当前,此功能仅支持 LLM 模型的 ``chat`` 功能。" +"Currently, this feature only supports the ``generate``, ``chat`` and " +"``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not " +"supported." +msgstr "" +"当前,此功能仅支持 LLM 模型的 ``generate``, ``chat`` 和 ``vision`` (多" +"模态) 功能。``tool call`` (工具调用)暂时不支持。" -#: ../../source/user_guide/continuous_batching.rst:66 +#: ../../source/user_guide/continuous_batching.rst:90 +msgid "" +"For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and " +"``glm-4v`` models are supported. More models will be supported in the " +"future. Please let us know your requirements." +msgstr "" +"对于多模态任务,当前支持 ``qwen-vl-chat`` ,``cogvlm2`` 和 ``glm-4v`` " +"模型。未来将加入更多模型,敬请期待。" + +#: ../../source/user_guide/continuous_batching.rst:92 msgid "" "If using GPU inference, this method will consume more GPU memory. Please " "be cautious when increasing the number of concurrent requests to the same" @@ -81,13 +122,21 @@ msgid "" "parameter to adjust the concurrency level, with a default value of " "``16``." msgstr "" -"如果使用 GPU 推理,此功能对显存要求较高。因此请谨慎提高对同一个模型的并发请求量。" -"``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度,默认值为 ``16`` 。" +"如果使用 GPU 推理,此功能对显存要求较高。因此请谨慎提高对同一个模型的并发" +"请求量。``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度" +",默认值为 ``16`` 。" -#: ../../source/user_guide/continuous_batching.rst:69 +#: ../../source/user_guide/continuous_batching.rst:95 msgid "" "This feature is still in the experimental stage, and we welcome your " "active feedback on any issues." +msgstr "此功能仍处于实验阶段,欢迎反馈任何问题。" + +#: ../../source/user_guide/continuous_batching.rst:97 +msgid "" +"After a period of testing, this method will remain enabled by default, " +"and the original inference method will be deprecated." msgstr "" -"此功能仍处于实验阶段,欢迎反馈任何问题。" +"一段时间的测试之后,此功能将代替原来的 transformers 推理逻辑成为默认行为" +"。原来的推理逻辑将被摒弃。" diff --git a/doc/source/user_guide/continuous_batching.rst b/doc/source/user_guide/continuous_batching.rst index 7c3a468099..47269fbd0a 100644 --- a/doc/source/user_guide/continuous_batching.rst +++ b/doc/source/user_guide/continuous_batching.rst @@ -54,16 +54,44 @@ Currently, this feature can be enabled under the following conditions: print('Model uid: ' + model_uid) -Once this feature is enabled, all ``chat`` requests will be managed by continuous batching, +Once this feature is enabled, all requests for LLMs will be managed by continuous batching, and the average throughput of requests made to a single model will increase. -The usage of the ``chat`` interface remains exactly the same as before, with no differences. +The usage of the LLM interface remains exactly the same as before, with no differences. + + +Abort your request +================== +In this mode, you can abort requests that are in the process of inference. + +#. First, add ``request_id`` option in ``generate_config``. For example: + +.. code-block:: bash + + from xinference.client import Client + client = Client("http://127.0.0.1:9997") + model = client.get_model("") + model.chat("", generate_config={"request_id": ""}) + +#. Then, abort the request using the ``request_id`` you have set. For example: + +.. code-block:: bash + + from xinference.client import Client + client = Client("http://127.0.0.1:9997") + client.abort_request("", "") + +Note that if your request has already finished, aborting the request will be a no-op. Note ==== -* Currently, this feature only supports the ``chat`` interface for ``LLM`` models. +* Currently, this feature only supports the ``generate``, ``chat`` and ``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not supported. + +* For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and ``glm-4v`` models are supported. More models will be supported in the future. Please let us know your requirements. * If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model. The ``launch_model`` interface provides the ``max_num_seqs`` parameter to adjust the concurrency level, with a default value of ``16``. * This feature is still in the experimental stage, and we welcome your active feedback on any issues. + +* After a period of testing, this method will remain enabled by default, and the original inference method will be deprecated.