xorbitsai · qinxuye · Jul 4, 2024 · Jul 4, 2024 · Jul 4, 2024
diff --git a/doc/source/getting_started/using_docker_image.rst b/doc/source/getting_started/using_docker_image.rst
@@ -11,7 +11,7 @@ Prerequisites
 =============
 * The image can only run in an environment with GPUs and CUDA installed, because Xinference in the image relies on Nvidia GPUs for acceleration.
 * CUDA must be successfully installed on the host machine. This can be determined by whether you can successfully execute the ``nvidia-smi`` command.
-* The CUDA version in the docker image is ``12.1``, and the CUDA version on the host machine should ideally be consistent with it. Be sure to keep the CUDA version on your host machine between ``11.8`` and ``12.2``, even if it is inconsistent.
+* The CUDA version in the docker image is ``12.4``, and the CUDA version on the host machine should be ``12.4`` or above, and the NVIDIA driver version should be ``550`` or above.
 
 
 Docker Image

diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po b/doc/source/locale/zh_CN/LC_MESSAGES/getting_started/using_docker_image.po
@@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: Xinference \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2024-04-30 14:54+0800\n"
+"POT-Creation-Date: 2024-07-04 15:14+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: zh_CN\n"
@@ -17,15 +17,15 @@ msgstr ""
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.14.0\n"
+"Generated-By: Babel 2.11.0\n"
 
 #: ../../source/getting_started/using_docker_image.rst:5
 msgid "Xinference Docker Image"
 msgstr "Docker 镜像"
 
 #: ../../source/getting_started/using_docker_image.rst:7
 msgid "Xinference provides official images for use on Dockerhub."
-msgstr "Xinference 在 Dockerhub 中上传了官方镜像。"
+msgstr "Xinference 在 Dockerhub 和 阿里云容器镜像服务 中上传了官方镜像。"
 
 #: ../../source/getting_started/using_docker_image.rst:11
 msgid "Prerequisites"
@@ -48,13 +48,12 @@ msgstr "保证 CUDA 在机器上正确安装。可以使用 ``nvidia-smi`` 检
 
 #: ../../source/getting_started/using_docker_image.rst:14
 msgid ""
-"The CUDA version in the docker image is ``12.1``, and the CUDA version on"
-" the host machine should ideally be consistent with it. Be sure to keep "
-"the CUDA version on your host machine between ``11.8`` and ``12.2``, even"
-" if it is inconsistent."
+"The CUDA version in the docker image is ``12.4``, and the CUDA version on"
+" the host machine should be ``12.4`` or above, and the NVIDIA driver "
+"version should be ``550`` or above."
 msgstr ""
-"镜像中的 CUDA 版本是 ``12.1`` ，推荐机器上的版本与之保持一致。如果不一致"
-"，需要保证CUDA 版本在 ``11.8`` 与 ``12.2`` 之间。"
+"镜像中的 CUDA 版本为 ``12.4`` 。为了不出现预期之外的问题，请将宿主机的 CUDA 版本和 NVIDIA Driver 版本分别"
+"升级到 ``12.4`` 和 ``550`` 以上。"
 
 #: ../../source/getting_started/using_docker_image.rst:18
 msgid "Docker Image"
@@ -65,7 +64,10 @@ msgid ""
 "The official image of Xinference is available on DockerHub in the "
 "repository ``xprobe/xinference``. Available tags include:"
 msgstr ""
-"Xinference 的官方镜像在 Dockerhub 的 ``xprobe/xinference`` 仓库里。目前可用版本包括："
+"当前，可以通过两个渠道拉取 Xinference 的官方镜像。"
+"1. 在 Dockerhub 的 ``xprobe/xinference`` 仓库里。"
+"2. Dockerhub 中的镜像会同步上传一份到阿里云公共镜像仓库中，供访问 Dockerhub 有困难的用户拉取。"
+"拉取命令：``docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:<tag>`` 。目前可用的标签包括："
 
 #: ../../source/getting_started/using_docker_image.rst:22
 msgid ""

diff --git a/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po b/doc/source/locale/zh_CN/LC_MESSAGES/user_guide/continuous_batching.po
@@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: Xinference \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2024-06-07 14:38+0800\n"
+"POT-Creation-Date: 2024-07-04 16:08+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <[email protected]>\n"
@@ -28,8 +28,8 @@ msgid ""
 " Xinference aims to provide this optimization capability when using the "
 "transformers engine as well."
 msgstr ""
-"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference 旨在"
-"通过这项技术提升 ``transformers`` 推理引擎的吞吐。"
+"连续批处理是诸如 ``VLLM`` 这样的推理引擎中提升吞吐的重要技术。Xinference "
+"旨在通过这项技术提升 ``transformers`` 推理引擎的吞吐。"
 
 #: ../../source/user_guide/continuous_batching.rst:11
 msgid "Usage"
@@ -45,49 +45,98 @@ msgid ""
 "``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` to ``1`` when starting "
 "xinference. For example:"
 msgstr ""
-"首先，启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_BATCHING`` 置为 ``1`` 。"
+"首先，启动 Xinference 时需要将环境变量 ``XINFERENCE_TRANSFORMERS_ENABLE_"
+"BATCHING`` 置为 ``1`` 。"
 
 #: ../../source/user_guide/continuous_batching.rst:21
 msgid ""
 "Then, ensure that the ``transformers`` engine is selected when launching "
 "the model. For example:"
-msgstr ""
-"然后，启动 LLM 模型时选择 ``transformers`` 推理引擎。例如："
+msgstr "然后，启动 LLM 模型时选择 ``transformers`` 推理引擎。例如："
 
 #: ../../source/user_guide/continuous_batching.rst:57
 msgid ""
-"Once this feature is enabled, all ``chat`` requests will be managed by "
+"Once this feature is enabled, all requests for LLMs will be managed by "
 "continuous batching, and the average throughput of requests made to a "
-"single model will increase. The usage of the ``chat`` interface remains "
+"single model will increase. The usage of the LLM interface remains "
 "exactly the same as before, with no differences."
 msgstr ""
-"一旦此功能开启，``chat`` 接口将被此功能接管，别的接口不受影响。``chat`` 接口的使用方式没有任何变化。"
+"一旦此功能开启，LLM 模型的所有接口将被此功能接管。所有接口的使用方式没有"
+"任何变化。"
+
+#: ../../source/user_guide/continuous_batching.rst:63
+msgid "Abort your request"
+msgstr "中止请求"
+
+#: ../../source/user_guide/continuous_batching.rst:64
+msgid "In this mode, you can abort requests that are in the process of inference."
+msgstr ""
+"此功能中，你可以优雅地中止正在推理中的请求。"
+
+#: ../../source/user_guide/continuous_batching.rst:66
+msgid "First, add ``request_id`` option in ``generate_config``. For example:"
+msgstr ""
+"首先，在推理请求的 ``generate_config`` 中指定 ``request_id`` 选项。例如："
+
+#: ../../source/user_guide/continuous_batching.rst:75
+msgid ""
+"Then, abort the request using the ``request_id`` you have set. For "
+"example:"
+msgstr ""
+"接着，带着你指定的 ``request_id`` 去中止该请求。例如："
 
-#: ../../source/user_guide/continuous_batching.rst:62
+#: ../../source/user_guide/continuous_batching.rst:83
+msgid ""
+"Note that if your request has already finished, aborting the request will"
+" be a no-op."
+msgstr ""
+"注意，如果你的请求已经结束，那么此操作将什么都不做。"
+
+#: ../../source/user_guide/continuous_batching.rst:86
 msgid "Note"
 msgstr "注意事项"
 
-#: ../../source/user_guide/continuous_batching.rst:64
+#: ../../source/user_guide/continuous_batching.rst:88
 msgid ""
-"Currently, this feature only supports the ``chat`` interface for ``LLM`` "
-"models."
-msgstr "当前，此功能仅支持 LLM 模型的 ``chat`` 功能。"
+"Currently, this feature only supports the ``generate``, ``chat`` and "
+"``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not "
+"supported."
+msgstr ""
+"当前，此功能仅支持 LLM 模型的 ``generate``, ``chat`` 和 ``vision`` （多"
+"模态） 功能。``tool call`` （工具调用）暂时不支持。"
 
-#: ../../source/user_guide/continuous_batching.rst:66
+#: ../../source/user_guide/continuous_batching.rst:90
+msgid ""
+"For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and "
+"``glm-4v`` models are supported. More models will be supported in the "
+"future. Please let us know your requirements."
+msgstr ""
+"对于多模态任务，当前支持 ``qwen-vl-chat`` ，``cogvlm2`` 和 ``glm-4v`` "
+"模型。未来将加入更多模型，敬请期待。"
+
+#: ../../source/user_guide/continuous_batching.rst:92
 msgid ""
 "If using GPU inference, this method will consume more GPU memory. Please "
 "be cautious when increasing the number of concurrent requests to the same"
 " model. The ``launch_model`` interface provides the ``max_num_seqs`` "
 "parameter to adjust the concurrency level, with a default value of "
 "``16``."
 msgstr ""
-"如果使用 GPU 推理，此功能对显存要求较高。因此请谨慎提高对同一个模型的并发请求量。"
-"``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度，默认值为 ``16`` 。"
+"如果使用 GPU 推理，此功能对显存要求较高。因此请谨慎提高对同一个模型的并发"
+"请求量。``launch_model`` 接口提供可选参数 ``max_num_seqs`` 用于调整并发度"
+"，默认值为 ``16`` 。"
 
-#: ../../source/user_guide/continuous_batching.rst:69
+#: ../../source/user_guide/continuous_batching.rst:95
 msgid ""
 "This feature is still in the experimental stage, and we welcome your "
 "active feedback on any issues."
+msgstr "此功能仍处于实验阶段，欢迎反馈任何问题。"
+
+#: ../../source/user_guide/continuous_batching.rst:97
+msgid ""
+"After a period of testing, this method will remain enabled by default, "
+"and the original inference method will be deprecated."
 msgstr ""
-"此功能仍处于实验阶段，欢迎反馈任何问题。"
+"一段时间的测试之后，此功能将代替原来的 transformers 推理逻辑成为默认行为"
+"。原来的推理逻辑将被摒弃。"
 
diff --git a/doc/source/user_guide/continuous_batching.rst b/doc/source/user_guide/continuous_batching.rst
@@ -54,16 +54,44 @@ Currently, this feature can be enabled under the following conditions:
     print('Model uid: ' + model_uid)
 
 
-Once this feature is enabled, all ``chat`` requests will be managed by continuous batching,
+Once this feature is enabled, all requests for LLMs will be managed by continuous batching,
 and the average throughput of requests made to a single model will increase.
-The usage of the ``chat`` interface remains exactly the same as before, with no differences.
+The usage of the LLM interface remains exactly the same as before, with no differences.
+
+
+Abort your request
+==================
+In this mode, you can abort requests that are in the process of inference.
+
+#. First, add ``request_id`` option in ``generate_config``. For example:
+
+.. code-block:: bash
+
+    from xinference.client import Client
+    client = Client("http://127.0.0.1:9997")
+    model = client.get_model("<model_uid>")
+    model.chat("<prompt>", generate_config={"request_id": "<your_unique_request_id>"})
+
+#. Then, abort the request using the ``request_id`` you have set. For example:
+
+.. code-block:: bash
+
+    from xinference.client import Client
+    client = Client("http://127.0.0.1:9997")
+    client.abort_request("<model_uid>", "<your_unique_request_id>")
+
+Note that if your request has already finished, aborting the request will be a no-op.
 
 Note
 ====
 
-* Currently, this feature only supports the ``chat`` interface for ``LLM`` models.
+* Currently, this feature only supports the ``generate``, ``chat`` and ``vision`` tasks for ``LLM`` models. The ``tool call`` tasks are not supported.
+
+* For ``vision`` tasks, currently only ``qwen-vl-chat``, ``cogvlm2``, and ``glm-4v`` models are supported. More models will be supported in the future. Please let us know your requirements.
 
 * If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model.
   The ``launch_model`` interface provides the ``max_num_seqs`` parameter to adjust the concurrency level, with a default value of ``16``.
 
 * This feature is still in the experimental stage, and we welcome your active feedback on any issues.
+
+* After a period of testing, this method will remain enabled by default, and the original inference method will be deprecated.