Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Why is running MInference/examples/run_vllm.py not as fast as running vllm alone? #43

Open
zjjznw123 opened this issue Jul 16, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@zjjznw123
Copy link

zjjznw123 commented Jul 16, 2024

Describe the issue

from vllm import LLM, SamplingParams

from minference import MInference

prompts =  [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

prompts = prompts*100

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=10,
)
model_name = "Qwen/Qwen2-7B-Instruct/"

llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    
)

# Patch MInference Module
minference_patch = MInference("vllm", model_name)
llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params)


import time
t1 = time.time()

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
  
t2 = time.time()

print('minference_time:',t2-t1)


print('=============================================================================')


from vllm import LLM, SamplingParams

from minference import MInference

prompts =  [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

prompts = prompts*100


sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=10,
)
model_name = "Qwen/Qwen2-7B-Instruct/"


llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    
)

# Patch MInference Module
#minference_patch = MInference("vllm", model_name)
#llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params)


import time
t1 = time.time()

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
  
t2 = time.time()

print('vllm_time:',t2-t1)

result:
minference_time:0.0003895759582519531s
vllm_time:0.0002791881561279297s
Why is minference_time greater than vllm_time?

@zjjznw123 zjjznw123 added the question Further information is requested label Jul 16, 2024
@iofu728 iofu728 self-assigned this Jul 16, 2024
@iofu728
Copy link
Contributor

iofu728 commented Jul 16, 2024

Hi @zjjznw123, thanks for your support in MInference.

MInference involves some additional approximations which introduce overhead, making it slower than dense attention for short context sizes. The latency performance breaks even at around 30k tokens, as shown in our end-to-end benchmark: End-to-End Benchmark. The latency can be further optimized by adjusting the sparsity rate of sparse attention, though more aggressive sparse attention in smaller context windows hasn't been extensively tested.

We tested 128k with Qwen2-7B using hf, and its speedup ratio matches the results with LLaMA-3-8B in triton==2.1.0.

➜  MInference git:(main) ✗ python3.9 experiments/benchmarks/benchmark_e2e.py --attn_type minference --context_window 100000 --model_name Qwen/Qwen2-7B-Instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:02<00:00,  1.44it/s]
Patched model for minference..
100000 13.43383584022522
➜  MInference git:(main) ✗ python3.9 experiments/benchmarks/benchmark_e2e.py --attn_type minference_with_dense --context_window 100000 --model_name Qwen/Qwen2-7B-Instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████| 4/4 [00:02<00:00,  1.45it/s]
Patched model for minference..
100000 19.306786155700684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants