Skip to content

Commit

Permalink
save bench
Browse files Browse the repository at this point in the history
  • Loading branch information
hamelsmu committed Nov 27, 2023
1 parent cdfb61b commit e97035d
Show file tree
Hide file tree
Showing 4 changed files with 444 additions and 0 deletions.
195 changes: 195 additions & 0 deletions trt-bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Nvidia Triton w/ TensorRT-LLM Backend

Use the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main) backend with the [Nvidia Triton Inference Server](https://github.com/triton-inference-server/server).

The clearest end-to-end instructions I found was [this official blog post](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/).

## Build TensorRT-LLM container

Follow [these instructions](https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md) to build the docker container to compile the model.

When you are done this will have created a docker image called `tensorrt_llm/release:latest ` locally.

> Note: I had to fight nvidia-docker for this to work, I ended up having to uninstall Docker and anything related to nvidia container toolkit and re-install everything from scratch.
## Pull the model from HuggingFace

Make a directory called model_input and clone the Hugging Face model into it.

```bash
mkdir model_input
# Make sure you have git-lfs installed (https://git-lfs.com)
cd model_input
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
```

## Compile the model

To compile the model, mount the model you just pulled from HuggingFace and the model_output directory into the container and run the compile script. First, shell into the container like this:

```bash
# Make an output directory to store the compiled model assets
mkdir model_output

sudo docker run --gpus all --ulimit memlock=-1 --ipc=host --ulimit stack=67108864 -it -v ${PWD}/model_input:/model_input -v ${PWD}/model_output:/model_output tensorrt_llm/release:latest bash
```

Install the quantization toolkit per [these instructions](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/quantization#tensorrt-llm-quantization-toolkit-installation-guide):

```bash
cd /app/tensorrt_llm/examples/quantization
python -m pip install --upgrade pip
# Obtain the cuda version from the system. Assuming nvcc is available in path.
cuda_version=$(nvcc --version | grep 'release' | awk '{print $6}' | awk -F'[V.]' '{print $2$3}')
# Obtain the python version from the system.
python_version=$(python3 --version 2>&1 | awk '{print $2}' | awk -F. '{print $1$2}')
# Download and install the AMMO package from the DevZone.
wget https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.3.0.tar.gz
tar -xzf nvidia_ammo-0.3.0.tar.gz
pip install nvidia_ammo-0.3.0/nvidia_ammo-0.3.0+cu$cuda_version-cp$python_version-cp$python_version-linux_x86_64.whl
# Install the additional requirements
pip install -r requirements.txt
```

Then quantize the model, this took < 10 minutes on my RTX 6000 Ada (so be patient):

```bash
# Quantize HF LLaMA 7B checkpoint into INT4 AWQ format
cd /app/tensorrt_llm/examples/llama
for sz in 7 13 70; do
python quantize.py --model_dir /model_input/Llama-2-${sz}b-chat-hf/ \
--dtype float16 \
--qformat int4_awq \
--export_path ./llama-${sz}b-4bit-gs128-awq.pt \
--calib_size 32
done
```

Then, run the compile script. Make sure your GPU memory is free when you do this:

```bash
cd /app/tensorrt_llm/examples/llama
# Compile the LLaMA models to TensorRT format
for sz in 7 13 70; do
sz=7
python build.py --model_dir /model_input/Llama-2-${sz}b-chat-hf/ \
--quant_ckpt_path ./llama-${sz}b-4bit-gs128-awq.pt \
--dtype float16 \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--remove_input_padding \
--use_inflight_batching \
--paged_kv_cache \
--use_weight_only \
--weight_only_precision int4_awq \
--max_batch_size 256 \
--per_group \
--output_dir /model_output/${sz}b \
--world_size 4 \
--tp_size 4
done
```


When you are done, exit the docker container. The compiled assets will be located in `model_output/`. You will see three files:

- `llama_float16_tp1_rank0.engine`: The main output of the build script, containing the executable graph of operations with the model weights embedded.
- `config.json`: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine.
- `model.cache`: Caches some of the timing and optimization information from model compilation, making successive builds quicker.



## Prepare the model repository

The triton inference server works with model repositories that are specific directory structures with config files and other assets. You can read about model repositories [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html). The model repository for this example is quite complicated and involved setting up an ensemble of a preprocessing, model and postprocessing components along with lots of boilerplate code.

The easiest way to get started is to clone the example repo and modify it to suit your needs. First clone the the repo:

```bash
git clone -b release/0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
```

Copy the compiled model assets from `./model_output` into the model example repository:

```bash
cp model_output/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/
```

Then use their tools to modify the configuration files of all three components of the ensemble. Make sure you run these commands in the `tensorrtllm_backend` directory:

```bash
cd tensorrtllm_backend
# modify config for the model
python3 tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\
max_tokens_in_paged_kv_cache:,batch_scheduler_policy:guaranteed_completion,kv_cache_free_gpu_mem_fraction:0.2,\
max_num_sequences:4
```

Next, modify config for the preprocessing component, modify the `tokenizer_dir` to point to a model on HuggingFace Hub you used, I am using `NousResearch/Llama-2-7b-hf` which is a replica of `meta-llama/Llama-2-7b-hf`, so we don't have to worry about the fiddly permissions on the original model.

```bash
# modify config for the preprocessing component
python tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
tokenizer_type:llama,tokenizer_dir:NousResearch/Llama-2-7b-hf

# modify config for the postprocessing component
python tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
tokenizer_type:llama,tokenizer_dir:NousResearch/Llama-2-7b-hf
```

## Prepare The Triton Server

Next, we have to mount the model repository we just created into the Triton server and do some additional work interactively before it is ready. Make sure you are in the `tensorrtllm_backend` directory when running the following commands because we also need to mount the `scripts` directory into the container.

```bash
sudo docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd)/all_models:/all_models \
-v $(pwd)/scripts:/opt/scripts \
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
```

Next, in the Docker container, login to the HuggingFace Hub:


```bash
huggingface-cli login --token <YOUR_TOKEN>
```

Then, install the python dependencies:

```bash
# Install python dependencies
pip install sentencepiece protobuf
```

Finally, start the Triton server:

```bash
# Launch Server
python /opt/scripts/launch_triton_server.py --world_size 1 --model_repo /all_models/inflight_batcher_llm
```

> Note: if you get an error `Unexpected tokenizer type: ${tokenizer_type}` this means you didn't run the `fill_template.py` script on the preprocessing and postprocessing config files correctly.
You will get output that looks like this:

```bash
I1101 14:59:56.742506 113 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1101 14:59:56.742703 113 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1101 14:59:56.828990 113 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
```

### Test the server

You can make a request with `curl` like this:

```bash
curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{"text_input": "How do I count to nine in French?",
"parameters": {"max_tokens": 100, "bad_words":[""],"stop_words":[""]}}'
```
70 changes: 70 additions & 0 deletions trt-bench/requests_bench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import asyncio
import time
import aiohttp
import statistics

# Shared concurrency counter
current_concurrency = 0

async def send_request(session, url, data, request_number, response_record):
global current_concurrency
print(f"Starting request #{request_number}")
current_concurrency += 1 # Increment concurrency when request starts
start_time = time.perf_counter()

async with session.post(url, json=data) as response:
await response.read()

end_time = time.perf_counter()
latency = end_time - start_time
response_record.append((current_concurrency, latency))
print(f"Finished request #{request_number}")
current_concurrency -= 1 # Decrement concurrency when request ends

async def main(duration, requests_per_second, output_seq_len):
url = 'http://localhost:8000/v2/models/ensemble/generate'
data = {
"text_input": "How do I count to ten in French?",
"parameters": {
"max_tokens": output_seq_len,
"min_length": output_seq_len,
"bad_words": [""],
"stop_words": ["</s>"],
# "stream": True
}
}

tasks = []
response_record = []
request_counter = 0

async with aiohttp.ClientSession() as session:
start_time = time.perf_counter()
while time.perf_counter() - start_time < duration:
request_counter += 1
task = asyncio.create_task(send_request(session, url, data, request_counter, response_record))
tasks.append(task)
await asyncio.sleep(1 / requests_per_second)
print(f"Current concurrency: {current_concurrency}")

await asyncio.gather(*tasks)

# Statistics
latencies = [item[1] for item in response_record]
average_latency = statistics.mean(latencies)
max_latency = max(latencies)
min_latency = min(latencies)
std_dev_latency = statistics.stdev(latencies)

print(f"Average Latency: {average_latency:.4f} seconds")
print(f"Max Latency: {max_latency:.4f} seconds")
print(f"Min Latency: {min_latency:.4f} seconds")
print(f"Standard Deviation of Latency: {std_dev_latency:.4f} seconds")



if __name__ == "__main__":
duration = 60 # Duration in seconds
requests_per_second = .3 # Requests per second
output_seq_len = 300
asyncio.run(main(duration, requests_per_second, output_seq_len))
20 changes: 20 additions & 0 deletions trt-bench/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
sudo apt-get update && sudo apt-get -y install git git-lfs

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull


# See https://developer.nvidia.com/cuda-gpus#compute to find out which version
# I'm using a A100 for this particular setup so that is `80-real`
make -C docker release_build CUDA_ARCHS="80-real"

cd ..
mkdir model_input
# Make sure you have git-lfs installed (https://git-lfs.com)
cd model_input
git clone https://huggingface.co/NousResearch/Llama-2-70b-chat-hf
Loading

0 comments on commit e97035d

Please sign in to comment.