Skip to content

AI-Mou/OLM-benchmark-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

OLM-benchmark-evaluation

This repo draws inspiration and references the ideas and code from https://github.com/huggingface/optimum-benchmark, with practical adjustments made for user convenience, allowing for direct and easy use within VS Code. Using microsoft/Phi-3-mini-4k-instruct as a case study, this repo independently verifies the performance of PyTorch inference and vLLM inference.

For the tests, I used an Azure A100 VM as the experimental environment. The generated test reports are stored in the local file system of the Azure GPU VM.

Tue Jun 11 00:46:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      9MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

You can run my code directly in VS Code, provided that you connect VS Code and the Azure GPU VM via a Jupyter server. Of course, you also have the option to run my code directly using Python. I compare the analysis results of PyTorch and vLLM inference for Phi-3-mini-4k-instruct on the A100 VM.

The files generated by running the code:

root@david1a100:~/result# ls
benchmark.log                                         benchmark_reportvllm.Phi-3-mini-4k-instruct.csv
benchmark_pytorch_Phi-3-mini-4k-instruct.csv          benchmark_reportvllm.Phi-3-mini-4k-instruct.json
benchmark_pytorch_Phi-3-mini-4k-instruct.json         benchmarkvllm.Phi-3-mini-4k-instruct.csv
benchmark_report_pytorch_Phi-3-mini-4k-instruct.csv   benchmarkvllm.Phi-3-mini-4k-instruct.json
benchmark_report_pytorch_Phi-3-mini-4k-instruct.json

Just compare benchmark_pytorch_Phi-3-mini-4k-instruct.json and benchmarkvllm.Phi-3-mini-4k-instruct.json files which contains the whole info.

Summary of Test Results
 

Configuration Details
Model: microsoft/Phi-3-mini-4k-instruct
Framework: PyTorch 2.3.0
Task: Text Generation
Hardware:
CPU: AMD EPYC 7V13 64-Core Processor
GPU: NVIDIA A100 80GB PCIe
RAM: ~232GB
Platform: Linux
Python Version: 3.10.14
Settings:
Iterations: 10
Warmup Runs: 5
Batch Size: 1
Sequence Length: 512
Generative Parameters: max_new_tokens=100, min_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9, do_sample=true
Metrics: Latency and Memory
Performance Metrics
 

Prefill Phase
Memory Usage:
Max RAM: 1277.6 MB
Max VRAM (Global): 18146.5 MB
Max VRAM (Process): 17217.6 MB
Max Reserved: 16693.3 MB
Max Allocated: 16198.3 MB
Latency:
Total: 2.56 seconds
Mean: 0.256 seconds
Standard Deviation: 0.0007 seconds
p50 (Median): 0.256 seconds
p90: 0.257 seconds
p95: 0.257 seconds
p99: 0.257 seconds
Throughput: 1998.8 tokens/second
Decode Phase
Memory Usage:
Max RAM: 1277.8 MB
Max VRAM (Global): 18146.5 MB
Max VRAM (Process): 17217.6 MB
Max Reserved: 16693.3 MB
Max Allocated: 16568.3 MB
Latency:
Total: 24.15 seconds
Mean: 2.415 seconds
Standard Deviation: 0.0184 seconds
p50 (Median): 2.411 seconds
p90: 2.429 seconds
p95: 2.446 seconds
p99: 2.460 seconds
Throughput: 41.0 tokens/second
Per Token Phase
Latency:
Total: 26.43 seconds (for 989 tokens)
Mean: 0.027 seconds
Standard Deviation: 0.0245 seconds
p50 (Median): 0.0242 seconds
p90: 0.0256 seconds
p95: 0.0264 seconds
p99: 0.0274 seconds
Throughput: 37.4 tokens/second
Key Observations
Memory Usage: The memory usage is consistent and well-managed across both prefill and decode phases.
Latency:
Prefill phase latency is low and stable (mean ~0.256 seconds).
Decode phase has higher latency (mean ~2.415 seconds) but remains within a small range, indicating consistency.
Throughput:
High throughput in the prefill phase (1998.8 tokens/second) due to initial token processing.
Decode phase throughput is significantly lower (41.0 tokens/second), reflecting the complexity of generating new tokens.
Summary of Test Results
 

Configuration Details
Model: microsoft/Phi-3-mini-4k-instruct
Framework: VLLM 0.4.3
Task: Text Generation
Hardware:
CPU: AMD EPYC 7V13 64-Core Processor
GPU: NVIDIA A100 80GB PCIe
RAM: ~232GB
Platform: Linux
Python Version: 3.10.14
Settings:
Iterations: 10
Warmup Runs: 5
Batch Size: 1
Sequence Length: 512
Generative Parameters: max_new_tokens=50, min_new_tokens=50, temperature=1.0, top_k=50, top_p=0.9, do_sample=true
Metrics: Latency and Memory
Performance Metrics
 

Prefill Phase
Memory Usage:
Max RAM: 5928.0 MB
Max VRAM (Global): 78605.3 MB
Max VRAM (Process): 77668.0 MB
Max Reserved: Not reported
Max Allocated: Not reported
Latency:
Total: 9.9281 seconds
Mean: 0.0269 seconds
Standard Deviation: 0.0004 seconds
p50 (Median): 0.0269 seconds
p90: 0.0275 seconds
p95: 0.0276 seconds
p99: 0.0278 seconds
Throughput: 19029.6 tokens/second
Decode Phase
Memory Usage:
Max RAM: 5928.0 MB
Max VRAM (Global): 78605.3 MB
Max VRAM (Process): 77668.0 MB
Max Reserved: Not reported
Max Allocated: Not reported
Latency:
Total: 9.7619 seconds
Mean: 0.4244 seconds
Standard Deviation: 0.0010 seconds
p50 (Median): 0.4243 seconds
p90: 0.4254 seconds
p95: 0.4256 seconds
p99: 0.4273 seconds
Throughput: 115.4 tokens/second
Per Token Phase
Not reported for memory, latency, throughput, energy, and efficiency.
Key Observations
Memory Usage:
The memory usage is higher compared to the PyTorch backend, with max RAM usage at 5928 MB and max VRAM usage at around 78,605 MB for global VRAM and 77,668 MB for process VRAM.
Latency:
Prefill phase latency is low and stable, with a mean of approximately 0.0269 seconds.
Decode phase has a higher latency with a mean of approximately 0.4244 seconds, but the standard deviation is low, indicating consistent performance.
Throughput:
The throughput in the prefill phase is very high at 19029.6 tokens/second.
Decode phase throughput is significantly lower at 115.4 tokens/second, reflecting the complexity and computational demand of generating new tokens.

Form the above info we could observe that vLLM has 3 times speed that Pytorch.The huggingface/accelerate in vLLM locks up a lot of GPU memory based on the maximum sequence during inference, This means that even though phi-3 has 128k, if you use vLLM inference and actually use 128K arguments, it will cause OOM. Set gpu_memory_utilization when using vLLM, after set gpu_memory_utilization, for example 20% which is sufficient for model. Therefore, the results of the performance tests are the same, and there is no performance degradation due to the setting of gpu_memory_utilization.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published