Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce MQAphroditeEngine #1056

Merged
merged 9 commits into from
Dec 31, 2024
Merged

feat: introduce MQAphroditeEngine #1056

merged 9 commits into from
Dec 31, 2024

Conversation

AlpinDale
Copy link
Member

@AlpinDale AlpinDale commented Dec 27, 2024

This PR introduces MQAphroditeEngine class, which tries to completely replace the previous AsyncAphrodite.

Huge thanks to the Neural Magic team for coming up with this design pattern.

Motivation

Why? Because the async class was extremely slow. See the stats below for a comparison:

Synchronous Engine (AphroditeEngine):

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128

Throughput: 66.53 requests/s, 17031.75 tokens/s

Asynchronous Engine (AsyncAphrodite):

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128 --async-engine

Throughput: 25.30 requests/s, 6475.93 tokens/s

Huge overhead! Over the course of the past few months, we've had many attempts to reduce this overhead by:

  • decoupling the API frontend using a multiprocessing server
  • improving OpenAI server's performance
  • etc

None of these worked as well to mitigate the issue.

What's new?

The new MQAphroditeEngine replaces the previous async/await pattern with a multiprocessing architecture using ZeroMQ for inter-process communication. This change seems to significantly improve performance under high concurrency by eliminating the asyncio event loop overhead and providing better isolation between the API server and model execution. For comparison:

python tests/benchmarks/engine/throughput.py --model EleutherAI/pythia-70m-deduped --num-prompts 2000 --max-model-len 512 --input-len 128 --output-len 128 --async-engine

Throughput: 53.30 requests/s, 13645.92 tokens/s

Not completely up to par with the AphroditeEngine class, but still an over ~2x perf improvement.

Architecture

Untitled diagram-2024-12-27-201542

The system is split into two main processes: the Engine process running the core AphroditeEngine, and the API server process handling HTTP requests and client comms.

The MQAphroditeEngine wraps the core AphroditeEngine (similar to AsyncAphrodite). It runs a background loop to process requests and manage comms through four distinct ZeroMQ socket types:

  • Input Socket: Receives generation requests
  • Output Socket: Streams model outputs
  • Health Socket: Monitors engine health
  • Data Socket: Handles startup and config

On the client side, MQAphroditeEngine provides the interface conforming to the EngineClient protocol. It handles request submission, output streaming, health checks, and error handling.

More...

Untitled diagram-2024-12-27-201635

When a client submits a request, it goes through MQAphroditeEngineClient which serializes and sends it via ZeroMQ to the engine process. The engine processes the request through AphroditeEngine and streams outputs back to the client. The client then deserializes and yields the results to the end user.

A few of the cool optimization tricks is: Optional async socket handling allows overlapping I/O with CPU computation through an engine callback mechanism. With ZeroMQ's very efficient (and lightweight) mesage passing, we can reduce comms overhead with minimal serialization/deserialization costs. Process isolation also prevents GIL contention and seems to provide better CPU resource util.

The error handling is a lot more robust too. See the code for more details.

Why does it work better?

The key seems to be in true parallelism through multiprocessing, as opposed to just asyncio's cooperative multitasking. Process isolation prevents interference between API handling and model execution, while ZeroMQ provides a much higher performance, but still keeps a low-overhead inter-process comms system.

Migration Guide

(will add this to the docs later)

Key API Changes

Initialization

# Before - AsyncAphrodite
engine = AsyncAphrodite.from_engine_args(engine_args)

# After - MQAphroditeEngine
# NOTE: You'll work with the client, not the engine directly
engine_client = MQAphroditeEngineClient(ipc_path, engine_config)

Request Generation
The core generation API remains largely the same, but error handling has changed:

# Before
try:
    async for output in engine.generate(prompt, sampling_params, request_id):
        yield output
except AsyncEngineDeadError:
    # Handle engine death
    
# After
try:
    async for output in engine_client.generate(prompt, sampling_params, request_id):
        yield output
except ENGINE_DEAD_ERROR:
    # Handle engine death

Health Checks

# Before
await engine.check_health()

# After
await engine_client.check_health()
# Note: Health checks are now automatically performed in background

Request Abortion

# Before
await engine.abort(request_id)

# After
await engine_client.abort(request_id)
# NOTE: More resilient to engine failures

Stats Logging

# Before
await engine.do_log_stats()

# After
await engine_client.do_log_stats()
# NOTE: Stats logging is now handled automatically by the engine process

Important Changes to Note

  1. Process Management: The engine now runs in a separate process. You need to ensure proper process lifecycle management.
  2. Error Handling: errors are now propagated through ZeroMQ and may be wrapped efficiently. Check for ENGINE_DEAD_ERROR and MQClientClosedError.
  3. Configuration: Additional config options for ZMQ communication:
  • IPC path config
  • Socket timeouts
  • Async socker processing options
  1. Resource Cleanup: The client needs proper cleanup when shutting down.
# Ensure cleanup of ZMQ context
engine_client.context.term()
  1. Embeddings: Currently not supported in multiprocessing mode.

TODO:

  • Fix model unloading/loading endpoint (need to implement shutdown in the MQAphroditeEngine)

@AlpinDale AlpinDale merged commit 9a7d551 into main Dec 31, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant