Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle timeout #113

Closed
wants to merge 9 commits into from
Closed

Conversation

frostyplanet
Copy link
Contributor

缘由

  1. xoscar原本是多路复用长链接通信, ActorRef和底层的 connection 无关. ActorRef 被drop也不会关闭底层链接. 假如一个client链接随机很多不同端点的话, 需要一个链接回收机制, 否则会带来非常多 fd占用.
  2. 通常租用的gpu服务器因为被放在 四层设备内通过 ip 映射来提供外网服务, 但有一些供应商的设备(可能是 iptables) 会话维持实现不好, 在虚拟ip上维护的链接到超过一定量,或者过了一段时间按之后,发生会话丢失的行为. 导致链接被 hang 死 。

第二点问题可以通过开 tcp keepalive 来规避, 但考虑到同时要解决第一点的问题, 设想先在客户端实现一个 idle时间后关闭链接的功能, 来达到一石二鸟。

目前实现思路

  1. 既然要定时关闭链接, 那肯定不能够在一个链接通信过程中关闭, 只能在 idle 情况下关闭. 由于链接是多路复用, 所以实现一个引用计数, 在 msg request发出前加1, msg response 收到后减1. 当引用计数为0 的时候,在保证原子性的前提下可以安全地关闭链接.

  2. 既然要后台定时工作的线程,考虑到 python asyncio loop 的跨线程和操作大部分时候是不安全的. 所以需要在线程隔离的基础上, 每个线程上下文中有一个 asyncio task来检查链接 idle 情况.

  3. Client 原本是抽象了 connection, 所以抽象了一个 CallerClient 来封装底层的Client, 并且在 send, listen(recv) 中操作原子计数 (add_ref, de_ref)

  4. Router 原本是负责address mapping, 是跨线程单例的设计, 里面cache有 thread local的 cache,并且同时维护了所有线程中的 client, 从线程间隔离的原因出发, 就需要把 Router 的链接管理功能拆分开, 只保留原本的 address mapping 功能. .
    改为在 ActorCaller 中管理 多个 Client(现在是 CallerClient), ActorCaller.get_client 中根据address 查找现在有没有现成能用的 client, 假如要开新链接, Router.get_client() 就返回一个新链接Client, 然后包装成 CallerClient 维护起来.

  5. ActorCaller.periodic_check 负责定时检查和回收链接: a. 链接是否已经 close (如果是就摘除), b. 链接是否达到一定 idle 时间没有信息发送, 并且引用计数是0 (如果是就摘除并关闭)

  6. 对于 RDMA 的 场景, 见 get_copy_to_client(), call_send_buffer 等相关调用, Client(现在换成了 CallerClient) 会被用于连续多次使用再关闭, 所以在 get_client 之后也额外地 add_ref, 彻底用完归还的时候 de_ref。

  7. 对于多线程场景,(比如 xinference worker _periodical_report_status ) 实现了一个 ActorCallerThreaded (作为 context/pool 到 ActorCaller 的proxy类, 用 ThreadLocal存放本线程的 ActorCaller, 保证不同线程相互隔离. 在第一个commit 中 ActorCallerThreaded 索引了所有线程的 ActorCaller, 在第二个 commit 简化代码中把多线程索引去掉了. (因为似乎没有频繁开新线程的场景 , python也没有线程退出的的 hook 机制, 无法知道哪些线程是不用了. 假如有这种场景, 就只能指望gc起作用)

新增参数

idle timeout 可以通过环境变量 XOSCAR_IDLE_TIMEOUT 来配, 默认 30s。一个长链接达到 idle_timeout不用就会被客户端回收.

服务端暂时没有对链接进行回收.

system default is too long. we add CONNECT_TIMEOUT default to 8 sec
Add XOSCAR_IDLE_TIMEOUT env and default to 30 (seconds)

Because client is multiplexed by couroutines, send and recv is async,
add CallerClient to wrap callback and do reference count.

Move connection(client) cache from router into ActorCaller.
In ActorCaller.periodic_check(),
* Check and close idle clients
* Rmove closed clients

context._get_copy_to_client moved to ActorCaller, and protect its usage
in context with reference count.
context._call() and _call_with_client mark deprecated and not used in
tests
Otherwise calling from a different thread (like worker.periodicial_report_status in isolation),
will block on wait_response
Router has no cache any more, just get_client() will get new connection.
Rewrite with mock
@XprobeBot XprobeBot added the gpu label Nov 20, 2024
@XprobeBot XprobeBot added this to the epic milestone Nov 20, 2024
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.32%. Comparing base (d6465c9) to head (11c0150).
Report is 18 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (d6465c9) and HEAD (11c0150). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (d6465c9) HEAD (11c0150)
unittests 9 7
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #113      +/-   ##
==========================================
- Coverage   88.97%   83.32%   -5.66%     
==========================================
  Files          48       46       -2     
  Lines        4038     4096      +58     
  Branches      770      373     -397     
==========================================
- Hits         3593     3413     -180     
- Misses        358      595     +237     
- Partials       87       88       +1     
Flag Coverage Δ
unittests 83.22% <ø> (-5.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants