feat: introduce MQAphroditeEngine #1056
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces
MQAphroditeEngine
class, which tries to completely replace the previousAsyncAphrodite
.Huge thanks to the Neural Magic team for coming up with this design pattern.
Motivation
Why? Because the async class was extremely slow. See the stats below for a comparison:
Synchronous Engine (
AphroditeEngine
):Asynchronous Engine (
AsyncAphrodite
):Huge overhead! Over the course of the past few months, we've had many attempts to reduce this overhead by:
None of these worked as well to mitigate the issue.
What's new?
The new
MQAphroditeEngine
replaces the previous async/await pattern with a multiprocessing architecture using ZeroMQ for inter-process communication. This change seems to significantly improve performance under high concurrency by eliminating the asyncio event loop overhead and providing better isolation between the API server and model execution. For comparison:Not completely up to par with the
AphroditeEngine
class, but still an over ~2x perf improvement.Architecture
The system is split into two main processes: the Engine process running the core
AphroditeEngine
, and the API server process handling HTTP requests and client comms.The
MQAphroditeEngine
wraps the coreAphroditeEngine
(similar toAsyncAphrodite
). It runs a background loop to process requests and manage comms through four distinct ZeroMQ socket types:On the client side,
MQAphroditeEngine
provides the interface conforming to theEngineClient
protocol. It handles request submission, output streaming, health checks, and error handling.More...
When a client submits a request, it goes through
MQAphroditeEngineClient
which serializes and sends it via ZeroMQ to the engine process. The engine processes the request throughAphroditeEngine
and streams outputs back to the client. The client then deserializes and yields the results to the end user.A few of the cool optimization tricks is: Optional async socket handling allows overlapping I/O with CPU computation through an engine callback mechanism. With ZeroMQ's very efficient (and lightweight) mesage passing, we can reduce comms overhead with minimal serialization/deserialization costs. Process isolation also prevents GIL contention and seems to provide better CPU resource util.
The error handling is a lot more robust too. See the code for more details.
Why does it work better?
The key seems to be in true parallelism through multiprocessing, as opposed to just asyncio's cooperative multitasking. Process isolation prevents interference between API handling and model execution, while ZeroMQ provides a much higher performance, but still keeps a low-overhead inter-process comms system.
Migration Guide
(will add this to the docs later)
Key API Changes
Initialization
Request Generation
The core generation API remains largely the same, but error handling has changed:
Health Checks
Request Abortion
Stats Logging
Important Changes to Note
ENGINE_DEAD_ERROR
andMQClientClosedError
.TODO: