Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

安装完成后xinference-local --host 0.0.0.0 --port 9997运行报错 #1835

Open
1 of 3 tasks
pan-common opened this issue Jul 10, 2024 · 7 comments
Open
1 of 3 tasks
Milestone

Comments

@pan-common
Copy link

System Info / 系統信息

ubuntu20.0.4
NVIDIA-SMI 535.104.05
Driver Version: 535.104.05
CUDA Version: 12.2

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

Name: xinference
Version: 0.13.0
Summary: Model Serving Made Easy
Home-page: https://github.com/xorbitsai/inference
Author: Qin Xuye
Author-email: [email protected]
License: Apache License 2.0
Location: /root/anaconda3/envs/py311/lib/python3.11/site-packages
Requires: aioprometheus, async-timeout, click, fastapi, fsspec, gradio, huggingface-hub, modelscope, openai, opencv-contrib-python, passlib, peft, pillow, pydantic, pynvml, python-jose, requests, s3fs, sse-starlette, tabulate, timm, torch, tqdm, typer, typing-extensions, uvicorn, xoscar
Required-by:

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

(py311) root@b721c068038e:/opt/xinference# xinference-local --host 0.0.0.0 --port 9997
2024-07-10 12:28:08,395 xinference.core.supervisor 83095 INFO Xinference supervisor 0.0.0.0:44062 started
2024-07-10 12:28:08,425 xinference.core.worker 83095 INFO Starting metrics export server at 0.0.0.0:None
2024-07-10 12:28:08,431 xinference.core.worker 83095 INFO Checking metrics export server...
2024-07-10 12:28:09,600 xinference.core.worker 83095 INFO Metrics server is started at: http://0.0.0.0:41815
2024-07-10 12:28:09,601 xinference.core.worker 83095 INFO Xinference worker 0.0.0.0:44062 started
2024-07-10 12:28:09,602 xinference.core.worker 83095 INFO Purge cache directory: /root/.xinference/cache
2024-07-10 12:28:11,604 xinference.core.worker 83095 ERROR Report status got error.
Traceback (most recent call last):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 799, in report_status
async with timeout(2):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
TimeoutError
2024-07-10 12:28:14,296 xinference.api.restful_api 82961 INFO Starting Xinference at endpoint: http://0.0.0.0:9997
2024-07-10 12:28:14,648 uvicorn.error 82961 INFO Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)
2024-07-10 12:28:18,618 xinference.core.worker 83095 ERROR Report status got error.
Traceback (most recent call last):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 799, in report_status
async with timeout(2):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
TimeoutError
2024-07-10 12:28:25,628 xinference.core.worker 83095 ERROR Report status got error.
Traceback (most recent call last):
File "/root/anaconda3/envs/py311/lib/python3.11/site-packages/xinference/core/worker.py", line 800, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Expected behavior / 期待表现

可以正常使用gpu显卡运行

@XprobeBot XprobeBot added the gpu label Jul 10, 2024
@XprobeBot XprobeBot added this to the v0.13.1 milestone Jul 10, 2024
@ChengjieLi28
Copy link
Contributor

@pan-common worker向supervisor汇报状态时出错。
先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。
然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

@XprobeBot XprobeBot modified the milestones: v0.13.1, v0.13.2 Jul 12, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Jul 19, 2024
@XprobeBot XprobeBot modified the milestones: v0.13.2, v0.13.4 Jul 26, 2024
@github-actions github-actions bot removed the stale label Jul 27, 2024
Copy link

github-actions bot commented Aug 6, 2024

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Aug 6, 2024
@gs80140
Copy link

gs80140 commented Sep 26, 2024

q

@pan-common worker向supervisor汇报状态时出错。 先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。 然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

我也遇到这个问题, 按你说的增加XINFERENCE_DISABLE_HEALTH_CHECK=1 配置就可以启动了. 报错具体内容如下

`
WARNING 09-26 17:01:26 _custom_ops.py:18] Failed to import from vllm._C with ImportError('/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
2024-09-26 17:01:32,290 xinference.core.supervisor 667146 INFO Xinference supervisor 127.0.0.1:22599 started
/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
2024-09-26 17:01:32,316 xinference.core.worker 667146 INFO Starting metrics export server at 127.0.0.1:None
2024-09-26 17:01:32,322 xinference.core.worker 667146 INFO Checking metrics export server...
2024-09-26 17:01:34,445 xinference.core.worker 667146 INFO Metrics server is started at: http://127.0.0.1:34503
2024-09-26 17:01:34,446 xinference.core.worker 667146 INFO Purge cache directory: /home/hum/.xinference/cache
2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG [request ee1ead84-7be5-11ef-9d4d-208810cdd0e8] Enter add_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f7fa559aff0>,127.0.0.1:22599, kwargs:
2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG Worker 127.0.0.1:22599 has been added successfully
2024-09-26 17:01:34,449 xinference.core.supervisor 667146 DEBUG [request ee1ead84-7be5-11ef-9d4d-208810cdd0e8] Leave add_worker, elapsed time: 0 s
2024-09-26 17:01:34,449 xinference.core.worker 667146 INFO Connected to supervisor as a fresh worker
2024-09-26 17:01:34,463 xinference.core.worker 667146 INFO Xinference worker 127.0.0.1:22599 started
2024-09-26 17:01:36,466 xinference.core.worker 667146 ERROR Report status got error.
Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status
async with timeout(2):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
TimeoutError
2024-09-26 17:01:36,477 xinference.core.supervisor 667146 DEBUG Worker 127.0.0.1:22599 resources: {}
2024-09-26 17:01:37,274 xinference.core.supervisor 667146 DEBUG Enter get_status, args: <xinference.core.supervisor.SupervisorActor object at 0x7f7fa559aff0>, kwargs:
2024-09-26 17:01:37,275 xinference.core.supervisor 667146 DEBUG Leave get_status, elapsed time: 0 s
2024-09-26 17:01:39,377 xinference.api.restful_api 666994 INFO Starting Xinference at endpoint: http://127.0.0.1:9997
2024-09-26 17:01:39,543 uvicorn.error 666994 INFO Uvicorn running on http://127.0.0.1:9997 (Press CTRL+C to quit)
2024-09-26 17:01:43,485 xinference.core.worker 667146 ERROR Report status got error.
Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status
async with timeout(2):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
TimeoutError
2024-09-26 17:01:50,493 xinference.core.worker 667146 ERROR Report status got error.
Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1026, in report_status
status = await asyncio.to_thread(gather_node_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/xinference/core/worker.py", line 1025, in report_status
async with timeout(2):
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 141, in aexit
self._do_exit(exc_type)
File "/home/hum/anaconda3/envs/xinf/lib/python3.11/site-packages/async_timeout/init.py", line 228, in _do_exit
raise asyncio.TimeoutError
TimeoutError

`

@gs80140
Copy link

gs80140 commented Sep 26, 2024

对CUDA有要求的吧?

@jasinliu
Copy link

jasinliu commented Nov 24, 2024

最新版本,同样报错,启动很慢,不知道什么原因

@jasinliu
Copy link

@pan-common worker向supervisor汇报状态时出错。 先尝试打开debug日志(另外你的错误没给全,请把完整的全贴上来,During handling of the above exception, another exception occurred:这句后面的都贴出来),看看有没有具体错误。 然后这样可以绕过汇报流程,看看能不能启动

XINFERENCE_DISABLE_HEALTH_CHECK=1 xinference-local --host 0.0.0.0 --port 9997

后续报错,就是重复TimeoutError,应该是在反复尝试。绕过汇报流程后可以很快开启。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants