[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

zjjznw123 · 2024-07-16T06:08:24Z

Describe the issue

from vllm import LLM, SamplingParams

from minference import MInference

prompts =  [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

prompts = prompts*100

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=10,
)
model_name = "Qwen/Qwen2-7B-Instruct/"

llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    
)

# Patch MInference Module
minference_patch = MInference("vllm", model_name)
llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params)


import time
t1 = time.time()

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
  
t2 = time.time()

print('minference_time:',t2-t1)


print('=============================================================================')


from vllm import LLM, SamplingParams

from minference import MInference

prompts =  [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

prompts = prompts*100


sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=10,
)
model_name = "Qwen/Qwen2-7B-Instruct/"


llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    
)

# Patch MInference Module
#minference_patch = MInference("vllm", model_name)
#llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params)


import time
t1 = time.time()

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
  
t2 = time.time()

print('vllm_time:',t2-t1)

result:
minference_time:0.0003895759582519531s
vllm_time:0.0002791881561279297s
Why is minference_time greater than vllm_time?

iofu728 · 2024-07-16T07:43:16Z

Hi @zjjznw123, thanks for your support in MInference.

MInference involves some additional approximations which introduce overhead, making it slower than dense attention for short context sizes. The latency performance breaks even at around 30k tokens, as shown in our end-to-end benchmark: End-to-End Benchmark. The latency can be further optimized by adjusting the sparsity rate of sparse attention, though more aggressive sparse attention in smaller context windows hasn't been extensively tested.

We tested 128k with Qwen2-7B using hf, and its speedup ratio matches the results with LLaMA-3-8B in triton==2.1.0.

➜  MInference git:(main) ✗ python3.9 experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 100000 --model_name Qwen/Qwen2-7B-Instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:02<00:00,  1.44it/s]
Patched model for minference..
100000 13.43383584022522
➜  MInference git:(main) ✗ python3.9 experiments/benchmarks/benchmark_e2e.py --attn_type minference_with_dense --context_window 100000 --model_name Qwen/Qwen2-7B-Instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:02<00:00,  1.45it/s]
Patched model for minference..
100000 19.306786155700684

zjjznw123 added the question Further information is requested label Jul 16, 2024

iofu728 self-assigned this Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

zjjznw123 commented Jul 16, 2024 •

edited by iofu728

Loading

iofu728 commented Jul 16, 2024

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

Comments

zjjznw123 commented Jul 16, 2024 • edited by iofu728 Loading

Describe the issue

iofu728 commented Jul 16, 2024

zjjznw123 commented Jul 16, 2024 •

edited by iofu728

Loading