TensorRT-LLM 0.11.0 Release #1970
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
examples/llama/README.md
).examples/qwen/README.md
.examples/phi/README.md
.examples/gpt/README.md
.distil-whisper/distil-large-v3
, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.numQueuedRequests
to the iteration stats log of the executor API.iterLatencyMilliSec
to the iteration stats log of the executor API.API Changes
trtllm-build
commandtrtllm-build
command), see documents: examples/whisper/README.md.max_batch_size
intrtllm-build
command is switched to 256 by default.max_num_tokens
intrtllm-build
command is switched to 8192 by default.max_output_len
and addedmax_seq_len
.--weight_only_precision
argument fromtrtllm-build
command.attention_qk_half_accumulation
argument fromtrtllm-build
command.use_context_fmha_for_generation
argument fromtrtllm-build
command.strongly_typed
argument fromtrtllm-build
command.max_seq_len
reads from the HuggingFace mode config now.free_gpu_memory_fraction
inModelRunnerCpp
tokv_cache_free_gpu_memory_fraction
.GptManager
APImaxBeamWidth
intoTrtGptModelOptionalParams
.schedulerConfig
intoTrtGptModelOptionalParams
.ModelRunnerCpp
, includingmax_tokens_in_paged_kv_cache
,kv_cache_enable_block_reuse
andenable_chunked_context
.ModelConfig
class, and all the options are moved toLLM
class.LLM
class, please refer toexamples/high-level-api/README.md
model
to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine.TLLM_HLAPI_BUILD_CACHE=1
or passingenable_build_cache=True
toLLM
class.BuildConfig
,SchedulerConfig
and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.LLM.generate()
andLLM.generate_async()
API.SamplingConfig
.SamplingParams
with more extensive parameters, seetensorrt_llm/hlapi/utils.py
.SamplingParams
contains and manages fields from Python bindings ofSamplingConfig
,OutputConfig
, and so on.LLM.generate()
output asRequestOutput
, seetensorrt_llm/hlapi/llm.py
.apps
examples, specially by rewriting bothchat.py
andfastapi_server.py
using theLLM
APIs, please refer to theexamples/apps/README.md
for details.chat.py
to support multi-turn conversation, allowing users to chat with a model in the terminal.fastapi_server.py
and eliminate the need formpirun
in multi-GPU scenarios.SpeculativeDecodingMode.h
to choose between different speculative decoding techniques.SpeculativeDecodingModule.h
base class for speculative decoding techniques.decodingMode.h
.gptManagerBenchmark
api
ingptManagerBenchmark
command isexecutor
by default now.max_batch_size
.max_num_tokens
.bias
argument to theLayerNorm
module, and supports non-bias layer normalization.GptSession
Python bindings.Model Updates
examples/jais/README.md
.examples/dit/README.md
.Video NeVA
section inexamples/multimodal/README.md
.examples/grok/README.md
.examples/phi/README.md
.Fixed Issues
top_k
type inexecutor.py
, thanks to the contribution from @vonjackustc in Fix top_k type (float => int32) executor.py #1329.qkv_bias
shape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候,qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.fpA_intB
, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.examples/qwenvl/requirements.txt
, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.lora_manager
, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.convert_hf_mpt_legacy
call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in Define hf_config explisitly for convert_hf_mpt_legacy #1534.use_fp8_context_fmha
broken outputs (use_fp8_context_fmha broken outputs #1539).quantize.py
is export data to config.json, thanks to the contribution from @janpetrov: quantize.py fails to export important data to config.json (eg rotary scaling) #1676shared_embedding_table
is not being set when loading Gemma [GEMMA]from_hugging_face
not settingshare_embedding_table
to True leading to incapacity to load Gemma #1799, thanks to the contribution from @mfuntowicz.ModelRunner
[ModelRunner] Fix stop and bad words list contiguous for offsets #1815, thanks to the contribution from @Marks101.FAST_BUILD
, thanks to the support from @lkm2835 in Add FAST_BUILD comment at #endif #1851.benchmarks/cpp/README.md
for gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.Infrastructure Changes
nvcr.io/nvidia/pytorch:24.05-py3
.nvcr.io/nvidia/tritonserver:24.05-py3
.Known Issues
OSError: exception: access violation reading 0x0000000000000000
. This issue is under investigation.Currently, there are two key branches in the project:
We are updating the
main
branch regularly with new features, bug fixes and performance optimizations. Therel
branch will be updated less frequently, and the exact frequencies depend on your feedback.Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.11.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions