-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle timeout #113
Idle timeout #113
Conversation
system default is too long. we add CONNECT_TIMEOUT default to 8 sec
Add XOSCAR_IDLE_TIMEOUT env and default to 30 (seconds) Because client is multiplexed by couroutines, send and recv is async, add CallerClient to wrap callback and do reference count. Move connection(client) cache from router into ActorCaller. In ActorCaller.periodic_check(), * Check and close idle clients * Rmove closed clients context._get_copy_to_client moved to ActorCaller, and protect its usage in context with reference count. context._call() and _call_with_client mark deprecated and not used in tests
Otherwise calling from a different thread (like worker.periodicial_report_status in isolation), will block on wait_response
Router has no cache any more, just get_client() will get new connection. Rewrite with mock
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #113 +/- ##
==========================================
- Coverage 88.97% 83.32% -5.66%
==========================================
Files 48 46 -2
Lines 4038 4096 +58
Branches 770 373 -397
==========================================
- Hits 3593 3413 -180
- Misses 358 595 +237
- Partials 87 88 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
缘由
第二点问题可以通过开 tcp keepalive 来规避, 但考虑到同时要解决第一点的问题, 设想先在客户端实现一个 idle时间后关闭链接的功能, 来达到一石二鸟。
目前实现思路
既然要定时关闭链接, 那肯定不能够在一个链接通信过程中关闭, 只能在 idle 情况下关闭. 由于链接是多路复用, 所以实现一个引用计数, 在 msg request发出前加1, msg response 收到后减1. 当引用计数为0 的时候,在保证原子性的前提下可以安全地关闭链接.
既然要后台定时工作的线程,考虑到 python asyncio loop 的跨线程和操作大部分时候是不安全的. 所以需要在线程隔离的基础上, 每个线程上下文中有一个 asyncio task来检查链接 idle 情况.
Client 原本是抽象了 connection, 所以抽象了一个 CallerClient 来封装底层的Client, 并且在 send, listen(recv) 中操作原子计数 (add_ref, de_ref)
Router 原本是负责address mapping, 是跨线程单例的设计, 里面cache有 thread local的 cache,并且同时维护了所有线程中的 client, 从线程间隔离的原因出发, 就需要把 Router 的链接管理功能拆分开, 只保留原本的 address mapping 功能. .
改为在 ActorCaller 中管理 多个 Client(现在是 CallerClient), ActorCaller.get_client 中根据address 查找现在有没有现成能用的 client, 假如要开新链接, Router.get_client() 就返回一个新链接Client, 然后包装成 CallerClient 维护起来.
ActorCaller.periodic_check 负责定时检查和回收链接: a. 链接是否已经 close (如果是就摘除), b. 链接是否达到一定 idle 时间没有信息发送, 并且引用计数是0 (如果是就摘除并关闭)
对于 RDMA 的 场景, 见 get_copy_to_client(), call_send_buffer 等相关调用, Client(现在换成了 CallerClient) 会被用于连续多次使用再关闭, 所以在 get_client 之后也额外地 add_ref, 彻底用完归还的时候 de_ref。
对于多线程场景,(比如 xinference worker _periodical_report_status ) 实现了一个 ActorCallerThreaded (作为 context/pool 到 ActorCaller 的proxy类, 用 ThreadLocal存放本线程的 ActorCaller, 保证不同线程相互隔离. 在第一个commit 中 ActorCallerThreaded 索引了所有线程的 ActorCaller, 在第二个 commit 简化代码中把多线程索引去掉了. (因为似乎没有频繁开新线程的场景 , python也没有线程退出的的 hook 机制, 无法知道哪些线程是不用了. 假如有这种场景, 就只能指望gc起作用)
新增参数
idle timeout 可以通过环境变量 XOSCAR_IDLE_TIMEOUT 来配, 默认 30s。一个长链接达到 idle_timeout不用就会被客户端回收.
服务端暂时没有对链接进行回收.