TensorRT-LLM 0.11.0 Release #1970

kaiyux · 2024-07-17T12:56:54Z

kaiyux
Jul 17, 2024
Maintainer

Hi,

We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:

Key Features and Enhancements

Supported very long context for LLaMA (see “Long context evaluation” section in examples/llama/README.md).
Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
  - SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA
  - INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA
  - Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in examples/qwen/README.md.
- Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in examples/phi/README.md.
- Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in examples/gpt/README.md.
Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (enc-dec triton backend support #800)
- Supported tensor parallelism.
Supported INT8 quantization with embedding layer excluded.
Updated default model for Whisper to distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.
Supported HuggingFace model automatically download for the Python high level API.
Supported explicit draft tokens for in-flight batching.
Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in Support custom calibration datasets #1762.
Added batched logits post processor.
Added Hopper qgmma kernel to XQA JIT codepath.
Supported tensor parallelism and expert parallelism enabled together for MoE.
Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
Added numQueuedRequests to the iteration stats log of the executor API.
Added iterLatencyMilliSec to the iteration stats log of the executor API.
Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in Add Huggingface model zoo from community #1674.

API Changes

[BREAKING CHANGE] trtllm-build command
- Migrated Whisper to unified workflow (trtllm-build command), see documents: examples/whisper/README.md.
- max_batch_size in trtllm-build command is switched to 256 by default.
- max_num_tokens in trtllm-build command is switched to 8192 by default.
- Deprecated max_output_len and added max_seq_len.
- Removed unnecessary --weight_only_precision argument from trtllm-build command.
- Removed attention_qk_half_accumulation argument from trtllm-build command.
- Removed use_context_fmha_for_generation argument from trtllm-build command.
- Removed strongly_typed argument from trtllm-build command.
- The default value of max_seq_len reads from the HuggingFace mode config now.
C++ runtime
- [BREAKING CHANGE] Renamed free_gpu_memory_fraction in ModelRunnerCpp to kv_cache_free_gpu_memory_fraction.
- [BREAKING CHANGE] Refactored GptManager API
  - Moved maxBeamWidth into TrtGptModelOptionalParams.
  - Moved schedulerConfig into TrtGptModelOptionalParams.
- Added some more options to ModelRunnerCpp, including max_tokens_in_paged_kv_cache, kv_cache_enable_block_reuse and enable_chunked_context.
[BREAKING CHANGE] Python high-level API
- Removed the ModelConfig class, and all the options are moved to LLM class.
- Refactored the LLM class, please refer to examples/high-level-api/README.md
  - Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
  - Exposed model to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.
  - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
  - Support build cache to reuse the built TensorRT-LLM engines by setting environment variable TLLM_HLAPI_BUILD_CACHE=1 or passing enable_build_cache=True to LLM class.
  - Exposed low-level options including BuildConfig, SchedulerConfig and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored LLM.generate() and LLM.generate_async() API.
  - Removed SamplingConfig.
  - Added SamplingParams with more extensive parameters, see tensorrt_llm/hlapi/utils.py.
    - The new SamplingParams contains and manages fields from Python bindings of SamplingConfig, OutputConfig, and so on.
  - Refactored LLM.generate() output as RequestOutput, see tensorrt_llm/hlapi/llm.py.
- Updated the apps examples, specially by rewriting both chat.py and fastapi_server.py using the LLM APIs, please refer to the examples/apps/README.md for details.
  - Updated the chat.py to support multi-turn conversation, allowing users to chat with a model in the terminal.
  - Fixed the fastapi_server.py and eliminate the need for mpirun in multi-GPU scenarios.
[BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of SpeculativeDecodingMode.h to choose between different speculative decoding techniques.
- Introduction of SpeculativeDecodingModule.h base class for speculative decoding techniques.
- Removed decodingMode.h.
gptManagerBenchmark
- [BREAKING CHANGE] api in gptManagerBenchmark command is executor by default now.
- Added a runtime max_batch_size.
- Added a runtime max_num_tokens.
[BREAKING CHANGE] Added a bias argument to the LayerNorm module, and supports non-bias layer normalization.
[BREAKING CHANGE] Removed GptSession Python bindings.

Model Updates

Supported Jais, see examples/jais/README.md.
Supported DiT, see examples/dit/README.md.
Supported VILA 1.5.
Supported Video NeVA, see Video NeVAsection in examples/multimodal/README.md.
Supported Grok-1, see examples/grok/README.md.
Supported Qwen1.5-110B with FP8 PTQ.
Supported Phi-3 small model with block sparse attention.
Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in Support internlm2 #1392.
Supported Phi-3-medium models, see examples/phi/README.md.
Supported Qwen1.5 MoE A2.7B.
Supported phi 3 vision multimodal.

Fixed Issues

Fixed brokens outputs for the cases when batch size is larger than 1. (use_fp8_context_fmha broken outputs #1539)
Fixed top_k type in executor.py, thanks to the contribution from @vonjackustc in Fix top_k type (float => int32) executor.py #1329.
Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in [ModelRunner] Fix stop & bad word list pointer offset. #1486.
Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in Fix typo in examples/whisper, Fix examples/whisper/run_faster_whisper.py #1328.
Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in [fix] export failure with CUDA driver < 526 and pynvml>=11.5.0 #1537.
Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in Fix nmt weight conversion #1660.
Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in Fix llama conversion with smooth quant #1650.
Fixed qkv_bias shape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候，qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.
Fixed the error of Ada traits for fpA_intB, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.
Update examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.
Fixed rsLoRA scaling in lora_manager, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.
Fixed Qwen1.5 checkpoint convert failure convert_checkpoint qwen1.5 error #1675.
Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in Loading Medusa Safetensors + AWQ Conversion correction #1535.
Fixed convert_hf_mpt_legacy call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.
Fixed use_fp8_context_fmha broken outputs (use_fp8_context_fmha broken outputs #1539).
Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in Fix pre-norm weight conversion for nmt #1723.
Fixed random seed initialization issue, thanks to the contribution from @pathorn in Reference input randomSeeds by idx rather than batchSlot #1742.
Fixed stop words and bad words in python bindings. (failed to use "stop_words_list" for tensorrt-llm==0.9.0 #1642)
Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: Can't convert-checkpoint Mistral 7B v0.3: safetensors_rust.SafetensorError: File does not contain tensor model.embed_tokens.weight #1732.
Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: Inflight batching for fp8 Llama and Mixtral is broken #1738
Fixed the failure when quantize.py is export data to config.json, thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676
Raise error when autopp detects unsupported quant plugin Unsupported auto parallel + int4 quantization on models #1626.
Fixed the issue that shared_embedding_table is not being set when loading Gemma [GEMMA] from_hugging_face not setting share_embedding_table to True leading to incapacity to load Gemma #1799, thanks to the contribution from @mfuntowicz.
Fixed stop and bad words list contiguous for ModelRunner [ModelRunner] Fix stop and bad words list contiguous for offsets #1815, thanks to the contribution from @Marks101.
Fixed missing comment for FAST_BUILD, thanks to the support from @lkm2835 in Add FAST_BUILD comment at #endif #1851.
Fixed the issues that Top-P sampling occasionally produces invalid tokens. Top-P sampling occasionally produces invalid tokens #1590
Fixed Window attention with huge window size(such as 65k) encountered shared mem error #1424.
Fixed zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529.
Fixed benchmarks/cpp/README.md for gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.
Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: [Docs] Fixed inference-request.md dead link triton-inference-server/tensorrtllm_backend#478, Fixed README.md for broken links triton-inference-server/tensorrtllm_backend#482 and FIX link reference in README.md triton-inference-server/tensorrtllm_backend#449.

Infrastructure Changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.05-py3.
Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.05-py3.
The dependent TensorRT version is updated to 10.1.0.
The dependent CUDA version is updated to 12.4.1.
The dependent PyTorch version is updated to 2.3.1.
The dependent ModelOpt version is updated to v0.13.0.

Known Issues

In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of OSError: exception: access violation reading 0x0000000000000000. This issue is under investigation.

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.11.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.11.0 Release #1970

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.11.0 Release #1970

kaiyux Jul 17, 2024 Maintainer

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure Changes

Known Issues

Replies: 0 comments

kaiyux
Jul 17, 2024
Maintainer