This repo draws inspiration and references the ideas and code from https://github.com/huggingface/optimum-benchmark, with practical adjustments made for user convenience, allowing for direct and easy use within VS Code. Using microsoft/Phi-3-mini-4k-instruct as a case study, this repo independently verifies the performance of PyTorch inference and vLLM inference.
For the tests, I used an Azure A100 VM as the experimental environment. The generated test reports are stored in the local file system of the Azure GPU VM.
Tue Jun 11 00:46:29 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 43W / 300W | 9MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
You can run my code directly in VS Code, provided that you connect VS Code and the Azure GPU VM via a Jupyter server. Of course, you also have the option to run my code directly using Python. I compare the analysis results of PyTorch and vLLM inference for Phi-3-mini-4k-instruct on the A100 VM.
The files generated by running the code:
root@david1a100:~/result# ls
benchmark.log benchmark_reportvllm.Phi-3-mini-4k-instruct.csv
benchmark_pytorch_Phi-3-mini-4k-instruct.csv benchmark_reportvllm.Phi-3-mini-4k-instruct.json
benchmark_pytorch_Phi-3-mini-4k-instruct.json benchmarkvllm.Phi-3-mini-4k-instruct.csv
benchmark_report_pytorch_Phi-3-mini-4k-instruct.csv benchmarkvllm.Phi-3-mini-4k-instruct.json
benchmark_report_pytorch_Phi-3-mini-4k-instruct.json
Just compare benchmark_pytorch_Phi-3-mini-4k-instruct.json and benchmarkvllm.Phi-3-mini-4k-instruct.json files which contains the whole info.
Summary of Test Results
Configuration Details
Model: microsoft/Phi-3-mini-4k-instruct
Framework: PyTorch 2.3.0
Task: Text Generation
Hardware:
CPU: AMD EPYC 7V13 64-Core Processor
GPU: NVIDIA A100 80GB PCIe
RAM: ~232GB
Platform: Linux
Python Version: 3.10.14
Settings:
Iterations: 10
Warmup Runs: 5
Batch Size: 1
Sequence Length: 512
Generative Parameters: max_new_tokens=100, min_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9, do_sample=true
Metrics: Latency and Memory
Performance Metrics
Prefill Phase
Memory Usage:
Max RAM: 1277.6 MB
Max VRAM (Global): 18146.5 MB
Max VRAM (Process): 17217.6 MB
Max Reserved: 16693.3 MB
Max Allocated: 16198.3 MB
Latency:
Total: 2.56 seconds
Mean: 0.256 seconds
Standard Deviation: 0.0007 seconds
p50 (Median): 0.256 seconds
p90: 0.257 seconds
p95: 0.257 seconds
p99: 0.257 seconds
Throughput: 1998.8 tokens/second
Decode Phase
Memory Usage:
Max RAM: 1277.8 MB
Max VRAM (Global): 18146.5 MB
Max VRAM (Process): 17217.6 MB
Max Reserved: 16693.3 MB
Max Allocated: 16568.3 MB
Latency:
Total: 24.15 seconds
Mean: 2.415 seconds
Standard Deviation: 0.0184 seconds
p50 (Median): 2.411 seconds
p90: 2.429 seconds
p95: 2.446 seconds
p99: 2.460 seconds
Throughput: 41.0 tokens/second
Per Token Phase
Latency:
Total: 26.43 seconds (for 989 tokens)
Mean: 0.027 seconds
Standard Deviation: 0.0245 seconds
p50 (Median): 0.0242 seconds
p90: 0.0256 seconds
p95: 0.0264 seconds
p99: 0.0274 seconds
Throughput: 37.4 tokens/second
Key Observations
Memory Usage: The memory usage is consistent and well-managed across both prefill and decode phases.
Latency:
Prefill phase latency is low and stable (mean ~0.256 seconds).
Decode phase has higher latency (mean ~2.415 seconds) but remains within a small range, indicating consistency.
Throughput:
High throughput in the prefill phase (1998.8 tokens/second) due to initial token processing.
Decode phase throughput is significantly lower (41.0 tokens/second), reflecting the complexity of generating new tokens.
Summary of Test Results
Configuration Details
Model: microsoft/Phi-3-mini-4k-instruct
Framework: VLLM 0.4.3
Task: Text Generation
Hardware:
CPU: AMD EPYC 7V13 64-Core Processor
GPU: NVIDIA A100 80GB PCIe
RAM: ~232GB
Platform: Linux
Python Version: 3.10.14
Settings:
Iterations: 10
Warmup Runs: 5
Batch Size: 1
Sequence Length: 512
Generative Parameters: max_new_tokens=50, min_new_tokens=50, temperature=1.0, top_k=50, top_p=0.9, do_sample=true
Metrics: Latency and Memory
Performance Metrics
Prefill Phase
Memory Usage:
Max RAM: 5928.0 MB
Max VRAM (Global): 78605.3 MB
Max VRAM (Process): 77668.0 MB
Max Reserved: Not reported
Max Allocated: Not reported
Latency:
Total: 9.9281 seconds
Mean: 0.0269 seconds
Standard Deviation: 0.0004 seconds
p50 (Median): 0.0269 seconds
p90: 0.0275 seconds
p95: 0.0276 seconds
p99: 0.0278 seconds
Throughput: 19029.6 tokens/second
Decode Phase
Memory Usage:
Max RAM: 5928.0 MB
Max VRAM (Global): 78605.3 MB
Max VRAM (Process): 77668.0 MB
Max Reserved: Not reported
Max Allocated: Not reported
Latency:
Total: 9.7619 seconds
Mean: 0.4244 seconds
Standard Deviation: 0.0010 seconds
p50 (Median): 0.4243 seconds
p90: 0.4254 seconds
p95: 0.4256 seconds
p99: 0.4273 seconds
Throughput: 115.4 tokens/second
Per Token Phase
Not reported for memory, latency, throughput, energy, and efficiency.
Key Observations
Memory Usage:
The memory usage is higher compared to the PyTorch backend, with max RAM usage at 5928 MB and max VRAM usage at around 78,605 MB for global VRAM and 77,668 MB for process VRAM.
Latency:
Prefill phase latency is low and stable, with a mean of approximately 0.0269 seconds.
Decode phase has a higher latency with a mean of approximately 0.4244 seconds, but the standard deviation is low, indicating consistent performance.
Throughput:
The throughput in the prefill phase is very high at 19029.6 tokens/second.
Decode phase throughput is significantly lower at 115.4 tokens/second, reflecting the complexity and computational demand of generating new tokens.
Form the above info we could observe that vLLM has 3 times speed that Pytorch.The huggingface/accelerate in vLLM locks up a lot of GPU memory based on the maximum sequence during inference, This means that even though phi-3 has 128k, if you use vLLM inference and actually use 128K arguments, it will cause OOM. Set gpu_memory_utilization when using vLLM, after set gpu_memory_utilization, for example 20% which is sufficient for model. Therefore, the results of the performance tests are the same, and there is no performance degradation due to the setting of gpu_memory_utilization.