forked from hamelsmu/llama-inference
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
444 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
# Nvidia Triton w/ TensorRT-LLM Backend | ||
|
||
Use the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main) backend with the [Nvidia Triton Inference Server](https://github.com/triton-inference-server/server). | ||
|
||
The clearest end-to-end instructions I found was [this official blog post](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/). | ||
|
||
## Build TensorRT-LLM container | ||
|
||
Follow [these instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md) to build the docker container to compile the model. | ||
|
||
When you are done this will have created a docker image called `tensorrt_llm/release:latest ` locally. | ||
|
||
> Note: I had to fight nvidia-docker for this to work, I ended up having to uninstall Docker and anything related to nvidia container toolkit and re-install everything from scratch. | ||
## Pull the model from HuggingFace | ||
|
||
Make a directory called model_input and clone the Hugging Face model into it. | ||
|
||
```bash | ||
mkdir model_input | ||
# Make sure you have git-lfs installed (https://git-lfs.com) | ||
cd model_input | ||
git lfs install | ||
git clone https://huggingface.co/meta-llama/Llama-2-7b-hf | ||
``` | ||
|
||
## Compile the model | ||
|
||
To compile the model, mount the model you just pulled from HuggingFace and the model_output directory into the container and run the compile script. First, shell into the container like this: | ||
|
||
```bash | ||
# Make an output directory to store the compiled model assets | ||
mkdir model_output | ||
|
||
sudo docker run --gpus all --ulimit memlock=-1 --ipc=host --ulimit stack=67108864 -it -v ${PWD}/model_input:/model_input -v ${PWD}/model_output:/model_output tensorrt_llm/release:latest bash | ||
``` | ||
|
||
Install the quantization toolkit per [these instructions](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/quantization#tensorrt-llm-quantization-toolkit-installation-guide): | ||
|
||
```bash | ||
cd /app/tensorrt_llm/examples/quantization | ||
python -m pip install --upgrade pip | ||
# Obtain the cuda version from the system. Assuming nvcc is available in path. | ||
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}') | ||
# Obtain the python version from the system. | ||
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}') | ||
# Download and install the AMMO package from the DevZone. | ||
wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz | ||
tar -xzf nvidia_ammo-0.3.0.tar.gz | ||
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl | ||
# Install the additional requirements | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Then quantize the model, this took < 10 minutes on my RTX 6000 Ada (so be patient): | ||
|
||
```bash | ||
# Quantize HF LLaMA 7B checkpoint into INT4 AWQ format | ||
cd /app/tensorrt_llm/examples/llama | ||
for sz in 7 13 70; do | ||
python quantize.py --model_dir /model_input/Llama-2-${sz}b-chat-hf/ \ | ||
--dtype float16 \ | ||
--qformat int4_awq \ | ||
--export_path ./llama-${sz}b-4bit-gs128-awq.pt \ | ||
--calib_size 32 | ||
done | ||
``` | ||
|
||
Then, run the compile script. Make sure your GPU memory is free when you do this: | ||
|
||
```bash | ||
cd /app/tensorrt_llm/examples/llama | ||
# Compile the LLaMA models to TensorRT format | ||
for sz in 7 13 70; do | ||
sz=7 | ||
python build.py --model_dir /model_input/Llama-2-${sz}b-chat-hf/ \ | ||
--quant_ckpt_path ./llama-${sz}b-4bit-gs128-awq.pt \ | ||
--dtype float16 \ | ||
--use_gpt_attention_plugin float16 \ | ||
--use_gemm_plugin float16 \ | ||
--remove_input_padding \ | ||
--use_inflight_batching \ | ||
--paged_kv_cache \ | ||
--use_weight_only \ | ||
--weight_only_precision int4_awq \ | ||
--max_batch_size 256 \ | ||
--per_group \ | ||
--output_dir /model_output/${sz}b \ | ||
--world_size 4 \ | ||
--tp_size 4 | ||
done | ||
``` | ||
|
||
|
||
When you are done, exit the docker container. The compiled assets will be located in `model_output/`. You will see three files: | ||
|
||
- `llama_float16_tp1_rank0.engine`: The main output of the build script, containing the executable graph of operations with the model weights embedded. | ||
- `config.json`: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine. | ||
- `model.cache`: Caches some of the timing and optimization information from model compilation, making successive builds quicker. | ||
|
||
|
||
|
||
## Prepare the model repository | ||
|
||
The triton inference server works with model repositories that are specific directory structures with config files and other assets. You can read about model repositories [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html). The model repository for this example is quite complicated and involved setting up an ensemble of a preprocessing, model and postprocessing components along with lots of boilerplate code. | ||
|
||
The easiest way to get started is to clone the example repo and modify it to suit your needs. First clone the the repo: | ||
|
||
```bash | ||
git clone -b release/0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git | ||
``` | ||
|
||
Copy the compiled model assets from `./model_output` into the model example repository: | ||
|
||
```bash | ||
cp model_output/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/ | ||
``` | ||
|
||
Then use their tools to modify the configuration files of all three components of the ensemble. Make sure you run these commands in the `tensorrtllm_backend` directory: | ||
|
||
```bash | ||
cd tensorrtllm_backend | ||
# modify config for the model | ||
python3 tools/fill_template.py --in_place \ | ||
all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \ | ||
decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\ | ||
max_tokens_in_paged_kv_cache:,batch_scheduler_policy:guaranteed_completion,kv_cache_free_gpu_mem_fraction:0.2,\ | ||
max_num_sequences:4 | ||
``` | ||
|
||
Next, modify config for the preprocessing component, modify the `tokenizer_dir` to point to a model on HuggingFace Hub you used, I am using `NousResearch/Llama-2-7b-hf` which is a replica of `meta-llama/Llama-2-7b-hf`, so we don't have to worry about the fiddly permissions on the original model. | ||
|
||
```bash | ||
# modify config for the preprocessing component | ||
python tools/fill_template.py --in_place \ | ||
all_models/inflight_batcher_llm/preprocessing/config.pbtxt \ | ||
tokenizer_type:llama,tokenizer_dir:NousResearch/Llama-2-7b-hf | ||
|
||
# modify config for the postprocessing component | ||
python tools/fill_template.py --in_place \ | ||
all_models/inflight_batcher_llm/postprocessing/config.pbtxt \ | ||
tokenizer_type:llama,tokenizer_dir:NousResearch/Llama-2-7b-hf | ||
``` | ||
|
||
## Prepare The Triton Server | ||
|
||
Next, we have to mount the model repository we just created into the Triton server and do some additional work interactively before it is ready. Make sure you are in the `tensorrtllm_backend` directory when running the following commands because we also need to mount the `scripts` directory into the container. | ||
|
||
```bash | ||
sudo docker run -it --rm --gpus all --network host --shm-size=1g \ | ||
-v $(pwd)/all_models:/all_models \ | ||
-v $(pwd)/scripts:/opt/scripts \ | ||
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash | ||
``` | ||
|
||
Next, in the Docker container, login to the HuggingFace Hub: | ||
|
||
|
||
```bash | ||
huggingface-cli login --token <YOUR_TOKEN> | ||
``` | ||
|
||
Then, install the python dependencies: | ||
|
||
```bash | ||
# Install python dependencies | ||
pip install sentencepiece protobuf | ||
``` | ||
|
||
Finally, start the Triton server: | ||
|
||
```bash | ||
# Launch Server | ||
python /opt/scripts/launch_triton_server.py --world_size 1 --model_repo /all_models/inflight_batcher_llm | ||
``` | ||
|
||
> Note: if you get an error `Unexpected tokenizer type: ${tokenizer_type}` this means you didn't run the `fill_template.py` script on the preprocessing and postprocessing config files correctly. | ||
You will get output that looks like this: | ||
|
||
```bash | ||
I1101 14:59:56.742506 113 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001 | ||
I1101 14:59:56.742703 113 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000 | ||
I1101 14:59:56.828990 113 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002 | ||
``` | ||
|
||
### Test the server | ||
|
||
You can make a request with `curl` like this: | ||
|
||
```bash | ||
curl -X POST localhost:8000/v2/models/ensemble/generate -d \ | ||
'{"text_input": "How do I count to nine in French?", | ||
"parameters": {"max_tokens": 100, "bad_words":[""],"stop_words":[""]}}' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import asyncio | ||
import time | ||
import aiohttp | ||
import statistics | ||
|
||
# Shared concurrency counter | ||
current_concurrency = 0 | ||
|
||
async def send_request(session, url, data, request_number, response_record): | ||
global current_concurrency | ||
print(f"Starting request #{request_number}") | ||
current_concurrency += 1 # Increment concurrency when request starts | ||
start_time = time.perf_counter() | ||
|
||
async with session.post(url, json=data) as response: | ||
await response.read() | ||
|
||
end_time = time.perf_counter() | ||
latency = end_time - start_time | ||
response_record.append((current_concurrency, latency)) | ||
print(f"Finished request #{request_number}") | ||
current_concurrency -= 1 # Decrement concurrency when request ends | ||
|
||
async def main(duration, requests_per_second, output_seq_len): | ||
url = 'http://localhost:8000/v2/models/ensemble/generate' | ||
data = { | ||
"text_input": "How do I count to ten in French?", | ||
"parameters": { | ||
"max_tokens": output_seq_len, | ||
"min_length": output_seq_len, | ||
"bad_words": [""], | ||
"stop_words": ["</s>"], | ||
# "stream": True | ||
} | ||
} | ||
|
||
tasks = [] | ||
response_record = [] | ||
request_counter = 0 | ||
|
||
async with aiohttp.ClientSession() as session: | ||
start_time = time.perf_counter() | ||
while time.perf_counter() - start_time < duration: | ||
request_counter += 1 | ||
task = asyncio.create_task(send_request(session, url, data, request_counter, response_record)) | ||
tasks.append(task) | ||
await asyncio.sleep(1 / requests_per_second) | ||
print(f"Current concurrency: {current_concurrency}") | ||
|
||
await asyncio.gather(*tasks) | ||
|
||
# Statistics | ||
latencies = [item[1] for item in response_record] | ||
average_latency = statistics.mean(latencies) | ||
max_latency = max(latencies) | ||
min_latency = min(latencies) | ||
std_dev_latency = statistics.stdev(latencies) | ||
|
||
print(f"Average Latency: {average_latency:.4f} seconds") | ||
print(f"Max Latency: {max_latency:.4f} seconds") | ||
print(f"Min Latency: {min_latency:.4f} seconds") | ||
print(f"Standard Deviation of Latency: {std_dev_latency:.4f} seconds") | ||
|
||
|
||
|
||
if __name__ == "__main__": | ||
duration = 60 # Duration in seconds | ||
requests_per_second = .3 # Requests per second | ||
output_seq_len = 300 | ||
asyncio.run(main(duration, requests_per_second, output_seq_len)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#!/bin/bash | ||
# TensorRT-LLM uses git-lfs, which needs to be installed in advance. | ||
sudo apt-get update && sudo apt-get -y install git git-lfs | ||
|
||
git clone https://github.com/NVIDIA/TensorRT-LLM.git | ||
cd TensorRT-LLM | ||
git submodule update --init --recursive | ||
git lfs install | ||
git lfs pull | ||
|
||
|
||
# See https://developer.nvidia.com/cuda-gpus#compute to find out which version | ||
# I'm using a A100 for this particular setup so that is `80-real` | ||
make -C docker release_build CUDA_ARCHS="80-real" | ||
|
||
cd .. | ||
mkdir model_input | ||
# Make sure you have git-lfs installed (https://git-lfs.com) | ||
cd model_input | ||
git clone https://huggingface.co/NousResearch/Llama-2-70b-chat-hf |
Oops, something went wrong.