feat: introduce MQAphroditeEngine #1056

AlpinDale · 2024-12-27T20:22:45Z

This PR introduces MQAphroditeEngine class, which tries to completely replace the previous AsyncAphrodite.

Huge thanks to the Neural Magic team for coming up with this design pattern.

Motivation

Why? Because the async class was extremely slow. See the stats below for a comparison:

Synchronous Engine (AphroditeEngine):

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128

Throughput: 66.53 requests/s, 17031.75 tokens/s

Asynchronous Engine (AsyncAphrodite):

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128 --async-engine

Throughput: 25.30 requests/s, 6475.93 tokens/s

Huge overhead! Over the course of the past few months, we've had many attempts to reduce this overhead by:

decoupling the API frontend using a multiprocessing server
improving OpenAI server's performance
etc

None of these worked as well to mitigate the issue.

What's new?

The new MQAphroditeEngine replaces the previous async/await pattern with a multiprocessing architecture using ZeroMQ for inter-process communication. This change seems to significantly improve performance under high concurrency by eliminating the asyncio event loop overhead and providing better isolation between the API server and model execution. For comparison:

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128 --async-engine

Throughput: 53.30 requests/s, 13645.92 tokens/s

Not completely up to par with the AphroditeEngine class, but still an over ~2x perf improvement.

Architecture

The system is split into two main processes: the Engine process running the core AphroditeEngine, and the API server process handling HTTP requests and client comms.

The MQAphroditeEngine wraps the core AphroditeEngine (similar to AsyncAphrodite). It runs a background loop to process requests and manage comms through four distinct ZeroMQ socket types:

Input Socket: Receives generation requests
Output Socket: Streams model outputs
Health Socket: Monitors engine health
Data Socket: Handles startup and config

On the client side, MQAphroditeEngine provides the interface conforming to the EngineClient protocol. It handles request submission, output streaming, health checks, and error handling.

More...

When a client submits a request, it goes through MQAphroditeEngineClient which serializes and sends it via ZeroMQ to the engine process. The engine processes the request through AphroditeEngine and streams outputs back to the client. The client then deserializes and yields the results to the end user.

A few of the cool optimization tricks is: Optional async socket handling allows overlapping I/O with CPU computation through an engine callback mechanism. With ZeroMQ's very efficient (and lightweight) mesage passing, we can reduce comms overhead with minimal serialization/deserialization costs. Process isolation also prevents GIL contention and seems to provide better CPU resource util.

The error handling is a lot more robust too. See the code for more details.

Why does it work better?

The key seems to be in true parallelism through multiprocessing, as opposed to just asyncio's cooperative multitasking. Process isolation prevents interference between API handling and model execution, while ZeroMQ provides a much higher performance, but still keeps a low-overhead inter-process comms system.

Migration Guide

(will add this to the docs later)

Key API Changes

Initialization

# Before - AsyncAphrodite
engine = AsyncAphrodite.from_engine_args(engine_args)

# After - MQAphroditeEngine
# NOTE: You'll work with the client, not the engine directly
engine_client = MQAphroditeEngineClient(ipc_path, engine_config)

Request Generation
The core generation API remains largely the same, but error handling has changed:

# Before
try:
    async for output in engine.generate(prompt, sampling_params, request_id):
        yield output
except AsyncEngineDeadError:
    # Handle engine death
    
# After
try:
    async for output in engine_client.generate(prompt, sampling_params, request_id):
        yield output
except ENGINE_DEAD_ERROR:
    # Handle engine death

Health Checks

# Before
await engine.check_health()

# After
await engine_client.check_health()
# Note: Health checks are now automatically performed in background

Request Abortion

# Before
await engine.abort(request_id)

# After
await engine_client.abort(request_id)
# NOTE: More resilient to engine failures

Stats Logging

# Before
await engine.do_log_stats()

# After
await engine_client.do_log_stats()
# NOTE: Stats logging is now handled automatically by the engine process

Important Changes to Note

Process Management: The engine now runs in a separate process. You need to ensure proper process lifecycle management.
Error Handling: errors are now propagated through ZeroMQ and may be wrapped efficiently. Check for ENGINE_DEAD_ERROR and MQClientClosedError.
Configuration: Additional config options for ZMQ communication:

IPC path config
Socket timeouts
Async socker processing options

Resource Cleanup: The client needs proper cleanup when shutting down.

# Ensure cleanup of ZMQ context
engine_client.context.term()

Embeddings: Currently not supported in multiprocessing mode.

TODO:

Fix model unloading/loading endpoint (need to implement shutdown in the MQAphroditeEngine)

AlpinDale added 9 commits December 27, 2024 19:52

feat: introduce MQAphroditeEngine

b47a390

Merge branch 'main' into mqaphrodite

3dc8016

add dead_error property to engine client

80a9973

Merge branch 'main' into mqaphrodite

1d15cb1

fix model unload endpoint

b94c884

add a simple model load endpoint

8a4fc7f

take more args in model load field

118bbfe

take yaml config in model load endpoint

3391557

inline model switching

0247bdc

AlpinDale merged commit 9a7d551 into main Dec 31, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce MQAphroditeEngine #1056

feat: introduce MQAphroditeEngine #1056

AlpinDale commented Dec 27, 2024 •

edited

Loading

feat: introduce MQAphroditeEngine #1056

feat: introduce MQAphroditeEngine #1056

Conversation

AlpinDale commented Dec 27, 2024 • edited Loading

Motivation

What's new?

Architecture

More...

Why does it work better?

Migration Guide

Key API Changes

Important Changes to Note

AlpinDale commented Dec 27, 2024 •

edited

Loading