Releases: huggingface/text-generation-inference
v2.4.1
Notable changes
- Choose input/total tokens automatically based on available VRAM
- Support Qwen2 VL
- Decrease latency of very large batches (> 128)
What's Changed
- feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in #2687
- Avoiding timeout for bloom tests. by @Narsil in #2693
- Green main by @Narsil in #2697
- Choosing input/total tokens automatically based on available VRAM? by @Narsil in #2673
- We can have a tokenizer anywhere. by @Narsil in #2527
- Update poetry lock. by @Narsil in #2698
- Fixing auto bloom test. by @Narsil in #2699
- More timeout on docker start ? by @Narsil in #2701
- Monkey patching as a desperate measure. by @Narsil in #2704
- add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in #2702
- Support qwen2 vl by @drbh in #2689
- fix cuda graphs for qwen2-vl by @drbh in #2708
- fix: create position ids for text only input by @drbh in #2714
- fix: add chat_tokenize endpoint to api docs by @drbh in #2710
- Hotfixing auto length (warmup max_s was wrong). by @Narsil in #2716
- Fix prefix caching + speculative decoding by @tgaddair in #2711
- Fixing linting on main. by @Narsil in #2719
- nix: move to tgi-nix
main
by @danieldk in #2718 - fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in #2717
- add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in #2725
- Add initial support for compressed-tensors checkpoints by @danieldk in #2732
- nix: update nixpkgs by @danieldk in #2746
- benchmark: fix prefill throughput by @danieldk in #2741
- Fix: Change model_type from ssm to mamba by @mokeddembillel in #2740
- Fix: Change embeddings to embedding by @mokeddembillel in #2738
- fix response type of document for Text Generation Inference by @jitokim in #2743
- Upgrade outlines to 0.1.1 by @aW3st in #2742
- Upgrading our deps. by @Narsil in #2750
- feat: return streaming errors as an event formatted for openai's client by @drbh in #2668
- Remove vLLM dependency for CUDA by @danieldk in #2751
- fix: improve find_segments via numpy diff by @drbh in #2686
- add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in #2707
- Add support for compressed-tensors w8a8 int checkpoints by @danieldk in #2745
- feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in #2721
- Simplify two ipex conditions by @danieldk in #2755
- Update to moe-kernels 0.7.0 by @danieldk in #2720
- PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in #2645
- fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in #2760
- nix: update for outlines 0.1.4 by @danieldk in #2764
- Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in #2758
- nix: build and cache impure devshells by @danieldk in #2765
- fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in #2766
- nix: downgrade to outlines 0.1.3 by @danieldk in #2768
- fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in #2770
- fix: tweak grammar test response by @drbh in #2769
- Add a README section about using Nix by @danieldk in #2767
- Remove guideline from API by @Wauplin in #2762
- feat: Add automatic nightly benchmarks by @Hugoch in #2591
- feat: add payload limit by @OlivierDehaene in #2726
- Update to marlin-kernels 0.3.6 by @danieldk in #2771
- chore: prepare 2.4.1 release by @OlivierDehaene in #2773
New Contributors
- @tgaddair made their first contribution in #2711
- @mokeddembillel made their first contribution in #2740
- @jitokim made their first contribution in #2743
Full Changelog: v2.3.0...v2.4.1
v2.4.0
Notable changes
- Experimental prefill chunking (
PREFILL_CHUNKING=1
) - Experimental FP8 KV cache support
- Greatly decrease latency for large batches (> 128 requests)
- Faster MoE kernels and support for GPTQ-quantized MoE
- Faster implementation of MLLama
What's Changed
- nix: remove unused
_server.nix
file by @danieldk in #2538 - chore: Add old V2 backend by @OlivierDehaene in #2551
- Remove duplicated
RUN
inDockerfile
by @alvarobartt in #2547 - Micro cleanup. by @Narsil in #2555
- Hotfixing main by @Narsil in #2556
- Add support for scalar FP8 weight scales by @danieldk in #2550
- Add
DenseMoELayer
and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537 - Update the link to the Ratatui organization by @orhun in #2546
- Simplify crossterm imports by @orhun in #2545
- Adding note for private models in quick-tour document by @ariG23498 in #2548
- Hotfixing main. by @Narsil in #2562
- Cleanup Vertex + Chat by @Narsil in #2553
- More tensor cores. by @Narsil in #2558
- remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
- Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
- Fix build with
--features google
by @alvarobartt in #2566 - Improve support for GPUs with capability < 8 by @danieldk in #2575
- flashinfer: pass window size and dtype by @danieldk in #2574
- Remove compute capability lazy cell by @danieldk in #2580
- Update architecture.md by @ulhaqi12 in #2577
- Update ROCM libs and improvements by @mht-sharma in #2579
- Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
- feat: support phi3.5 moe by @drbh in #2479
- Move flake back to tgi-nix
main
by @danieldk in #2586 - MoE Marlin: support
desc_act
forgroupsize != -1
by @danieldk in #2590 - nix: experimental support for building a Docker container by @danieldk in #2470
- Mllama flash version by @Narsil in #2585
- Max token capacity metric by @Narsil in #2595
- CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
- Unroll notify error into generate response by @drbh in #2597
- New release 2.3.1 by @Narsil in #2604
- Revert "Unroll notify error into generate response" by @drbh in #2605
- nix: example of local package overrides during development by @danieldk in #2607
- Add basic FP8 KV cache support by @danieldk in #2603
- Fp8 Cache condition by @flozi00 in #2611
- enable mllama in intel platform by @sywangyi in #2610
- Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in #2617
- Add support for fused MoE Marlin for AWQ by @danieldk in #2616
- nix: move back to the tgi-nix main branch by @danieldk in #2620
- CI (2599): Update ToolType input schema by @drbh in #2601
- nix: add black and isort to the closure by @danieldk in #2619
- AMD CI by @Narsil in #2589
- feat: allow tool calling to respond without a tool by @drbh in #2614
- Update documentation to most recent stable version of TGI. by @Vaibhavs10 in #2625
- Intel ci by @Narsil in #2630
- Fixing intel Supports windowing. by @Narsil in #2637
- Small fixes for supported models by @osanseviero in #2471
- Cpu perf by @Narsil in #2596
- Clarify gated description and quicktour by @osanseviero in #2631
- update ipex to fix incorrect output of mllama in cpu by @sywangyi in #2640
- feat: enable pytorch xpu support for non-attention models by @dvrogozh in #2561
- Fixing linters. by @Narsil in #2650
- Rollback to
ChatRequest
for Vertex AI Chat instead ofVertexChat
by @alvarobartt in #2651 - Fp8 e4m3_fnuz support for rocm by @mht-sharma in #2588
- feat: prefill chunking by @OlivierDehaene in #2600
- Support
e4m3fn
KV cache by @danieldk in #2655 - Simplify the
attention
function by @danieldk in #2609 - fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in #2663
- fix: prefer inplace softmax to avoid copy by @drbh in #2661
- Break cycle between the attention implementations and KV cache by @danieldk in #2627
- CI job. Gpt awq 4 by @Narsil in #2665
- Make handling of FP8 scales more consisent by @danieldk in #2666
- Test Marlin MoE with
desc_act=true
by @danieldk in #2622 - break when there's nothing to read by @sywangyi in #2582
- Add
impureWithCuda
dev shell by @danieldk in #2677 - Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in #2632
- feat: natively support Granite models by @OlivierDehaene in #2682
- feat: allow any supported payload on /invocations by @OlivierDehaene in #2683
- flashinfer: reminder to remove contiguous call in the future by @danieldk in #2685
- Fix Phi 3.5 MoE tests by @danieldk in #2684
- Add support for FP8 KV cache scales by @danieldk in #2628
- Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in #2664
- [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in #2357
- Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in #2691
- Fixing mt0 test. by @Narsil in #2692
- Add support for stop words in TRTLLM by @mfuntowicz in #2678
- Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in #2688
New Contributors
- @alvarobartt made their first contribution in https://github.com/huggingface/...
v2.3.1
Important changes
- Added support for Mllama (3.2, vision models). Flashed, unpadded.
- FP8 performance improvements
- Moe performance improvements
- BREAKING CHANGE - When using tools, models could answer with a tool call
notify_error
with the content error, it will instead output regular generation.
What's Changed
- nix: remove unused
_server.nix
file by @danieldk in #2538 - chore: Add old V2 backend by @OlivierDehaene in #2551
- Remove duplicated
RUN
inDockerfile
by @alvarobartt in #2547 - Micro cleanup. by @Narsil in #2555
- Hotfixing main by @Narsil in #2556
- Add support for scalar FP8 weight scales by @danieldk in #2550
- Add
DenseMoELayer
and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537 - Update the link to the Ratatui organization by @orhun in #2546
- Simplify crossterm imports by @orhun in #2545
- Adding note for private models in quick-tour document by @ariG23498 in #2548
- Hotfixing main. by @Narsil in #2562
- Cleanup Vertex + Chat by @Narsil in #2553
- More tensor cores. by @Narsil in #2558
- remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
- Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
- Fix build with
--features google
by @alvarobartt in #2566 - Improve support for GPUs with capability < 8 by @danieldk in #2575
- flashinfer: pass window size and dtype by @danieldk in #2574
- Remove compute capability lazy cell by @danieldk in #2580
- Update architecture.md by @ulhaqi12 in #2577
- Update ROCM libs and improvements by @mht-sharma in #2579
- Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
- feat: support phi3.5 moe by @drbh in #2479
- Move flake back to tgi-nix
main
by @danieldk in #2586 - MoE Marlin: support
desc_act
forgroupsize != -1
by @danieldk in #2590 - nix: experimental support for building a Docker container by @danieldk in #2470
- Mllama flash version by @Narsil in #2585
- Max token capacity metric by @Narsil in #2595
- CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
- Unroll notify error into generate response by @drbh in #2597
- New release 2.3.1 by @Narsil in #2604
New Contributors
- @alvarobartt made their first contribution in #2547
- @orhun made their first contribution in #2546
- @ariG23498 made their first contribution in #2548
- @ulhaqi12 made their first contribution in #2577
- @mht-sharma made their first contribution in #2579
Full Changelog: v2.3.0...v2.3.1
v2.3.0
Important changes
-
Renamed
HUGGINGFACE_HUB_CACHE
to useHF_HOME
. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from/data/models-....
to/data/hub/models-....
on the Docker. -
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to
flashinfer
(andflashdecoding
as a fallback for some specific models that aren't supported by flashinfer). -
Lots of performance improvements with Marlin and quantization.
What's Changed
- chore: update to torch 2.4 by @OlivierDehaene in #2259
- fix crash in multi-modal by @sywangyi in #2245
- fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in #2291
- Split up
layers.marlin
into several files by @danieldk in #2292 - fix: refactor adapter weight loading and mapping by @drbh in #2193
- Using g6 instead of g5. by @Narsil in #2281
- Some small fixes for the Torch 2.4.0 update by @danieldk in #2304
- Fixing idefics on g6 tests. by @Narsil in #2306
- Fix registry name by @XciD in #2307
- Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in #2313
- feat: add ruff and resolve issue by @drbh in #2262
- Run ci api key by @ErikKaum in #2315
- Install Marlin from standalone package by @danieldk in #2320
- fix: reject grammars without properties by @drbh in #2309
- patch-error-on-invalid-grammar by @ErikKaum in #2282
- fix: adjust test snapshots and small refactors by @drbh in #2323
- server quantize: store quantizer config in standard format by @danieldk in #2299
- Rebase TRT-llm by @Narsil in #2331
- Handle GPTQ-Marlin loading in
GPTQMarlinWeightLoader
by @danieldk in #2300 - Pr 2290 ci run by @drbh in #2329
- refactor usage stats by @ErikKaum in #2339
- enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in #2338
- Fix cache block size for flash decoding by @danieldk in #2351
- Unify attention output handling by @danieldk in #2343
- fix: attempt forward on flash attn2 to check hardware support by @drbh in #2335
- feat: include local lora adapter loading docs by @drbh in #2359
- fix: return the out tensor rather then the functions return value by @drbh in #2361
- feat: implement a templated endpoint for visibility into chat requests by @drbh in #2333
- feat: prefer stop over eos_token to align with openai finish_reason by @drbh in #2344
- feat: return the generated text when parsing fails by @drbh in #2353
- fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in #2364
- fix: prefer original layernorm names for 180B by @drbh in #2365
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in #2350
- add gptj modeling in TGI #2366 (CI RUN) by @drbh in #2372
- Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in #2371
- Pr 2374 ci branch by @drbh in #2378
- fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in #2346
- Pr 2337 ci branch by @drbh in #2379
- fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in #2381
- Update Quantization docs and minor doc fix. by @Vaibhavs10 in #2368
- Pr 2352 ci branch by @drbh in #2382
- Add FlashInfer support by @danieldk in #2354
- Add experimental flake by @danieldk in #2384
- Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in #2288
- flake: add fmt and clippy by @danieldk in #2389
- Update documentation for Supported models by @Vaibhavs10 in #2386
- flake: use rust-overlay by @danieldk in #2390
- Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in #2385
- feat: add guideline to chat request and template by @drbh in #2391
- Update flake for 9.0a capability in Torch by @danieldk in #2394
- nix: add router to the devshell by @danieldk in #2396
- Upgrade fbgemm by @Narsil in #2398
- Adding launcher to build. by @Narsil in #2397
- Fixing import exl2 by @Narsil in #2399
- Cpu dockerimage by @sywangyi in #2367
- Add support for prefix caching to the v3 router by @danieldk in #2392
- Keeping the benchmark somewhere by @Narsil in #2401
- feat: validate template variables before apply and improve sliding wi… by @drbh in #2403
- fix: allocate tmp based on sgmv kernel if available by @drbh in #2345
- fix: improve completions to send a final chunk with usage details by @drbh in #2336
- Updating the flake. by @Narsil in #2404
- Pr 2395 ci run by @drbh in #2406
- fix: include create_exllama_buffers and set_device for exllama by @drbh in #2407
- nix: incremental build of the launcher by @danieldk in #2410
- Adding more kernels to flake. by @Narsil in #2411
- add numa to improve cpu inference perf by @sywangyi in #2330
- fix: adds causal to attention params by @drbh in #2408
- nix: partial incremental build of the router by @danieldk in #2416
- Upgrading exl2. by @Narsil in #2415
- More fixes trtllm by @mfuntowicz in #2342
- nix: build router incrementally by @danieldk in #2422
- Fixing exl2 and other quanize tests again. by @Narsil in #2419
- Upgrading the tests to match the current workings. by @Narsil in #2423
- nix: try to reduce the number of Rust rebuilds by @danieldk in https://github.com/huggingface/text-generation-inference/pull/...
v2.2.0
Notable changes
- Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
- Gemma2 softcap support
- Deepseek v2 support.
- Lots of internal reworks/cleanup (allowing for cool features)
- Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
- Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)
What's Changed
- Preparing patch release. by @Narsil in #2186
- Adding "longrope" for Phi-3 (#2172) by @amihalik in #2179
- Refactor dead code - Removing all
flash_xxx.py
files. by @Narsil in #2166 - Fix Starcoder2 after refactor by @danieldk in #2189
- GPTQ CI improvements by @danieldk in #2151
- Consistently take
prefix
in model constructors by @danieldk in #2191 - fix dbrx & opt model prefix bug by @icyxp in #2201
- hotfix: Fix number of KV heads by @danieldk in #2202
- Fix incorrect cache allocation with multi-query by @danieldk in #2203
- Falcon/DBRX: get correct number of key-value heads by @danieldk in #2205
- add doc for intel gpus by @sywangyi in #2181
- fix: python deserialization by @jaluma in #2178
- update to metrics 0.23.0 or could work with metrics-exporter-promethe… by @sywangyi in #2190
- feat: use model name as adapter id in chat endpoints by @drbh in #2128
- Fix nccl regression on PyTorch 2.3 upgrade by @fxmarty in #2099
- Fix buildx cache + change runner type by @glegendre01 in #2176
- Fixed README ToC by @vinkamath in #2196
- Updating the self check by @Narsil in #2209
- Move quantized weight handling out of the
Weights
class by @danieldk in #2194 - Add support for FP8 on compute capability >=8.0, <8.9 by @danieldk in #2213
- fix: append DONE message to chat stream by @drbh in #2221
- [fix] Modifying base in yarn embedding by @SeongBeomLEE in #2212
- Use symmetric quantization in the
quantize
subcommand by @danieldk in #2120 - feat: simple mistral lora integration tests by @drbh in #2180
- fix custom cache dir by @ErikKaum in #2226
- fix: Remove bitsandbytes installation when running cpu-only install by @Hugoch in #2216
- Add support for AWQ-quantized Idefics2 by @danieldk in #2233
server quantize
: expose groupsize option by @danieldk in #2225- Remove stray
quantize
argument inget_weights_col_packed_qkv
by @danieldk in #2237 - fix(server): fix cohere by @OlivierDehaene in #2249
- Improve the handling of quantized weights by @danieldk in #2250
- Hotfix: fix of use of unquantized weights in Gemma GQA loading by @danieldk in #2255
- Hotfix: various GPT-based model fixes by @danieldk in #2256
- Hotfix: fix MPT after recent refactor by @danieldk in #2257
- Hotfix: pass through model revision in
VlmCausalLM
by @danieldk in #2258 - usage stats and crash reports by @ErikKaum in #2220
- add usage stats to toctree by @ErikKaum in #2260
- fix: adjust default tool choice by @drbh in #2244
- Add support for Deepseek V2 by @danieldk in #2224
- re-push to internal registry by @XciD in #2242
- Add FP8 release test by @danieldk in #2261
- feat(fp8): use fbgemm kernels and load fp8 weights directly by @OlivierDehaene in #2248
- fix(server): fix deepseekv2 loading by @OlivierDehaene in #2266
- Hotfix: fix of use of unquantized weights in Mixtral GQA loading by @icyxp in #2269
- legacy warning on text_generation client by @ErikKaum in #2271
- fix(ci): test new instances by @XciD in #2272
- fix(server): fix fp8 weight loading by @OlivierDehaene in #2268
- Softcapping for gemma2. by @Narsil in #2273
- use proper name for ci by @XciD in #2274
- Fixing mistral nemo. by @Narsil in #2276
- fix(l4): fix fp8 logic on l4 by @OlivierDehaene in #2277
- Add support for repacking AWQ weights for GPTQ-Marlin by @danieldk in #2278
- [WIP] Add support for Mistral-Nemo by supporting head_dim through config by @shaltielshmid in #2254
- Preparing for release. by @Narsil in #2285
- Add support for Llama 3 rotary embeddings by @danieldk in #2286
- hotfix: pin numpy by @danieldk in #2289
New Contributors
- @jaluma made their first contribution in #2178
- @vinkamath made their first contribution in #2196
- @ErikKaum made their first contribution in #2226
- @Hugoch made their first contribution in #2216
- @XciD made their first contribution in #2242
- @shaltielshmid made their first contribution in #2254
Full Changelog: v2.1.1...v2.2.0
v2.1.1
Main changes
- Bugfixes
- Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
- Use Marlin over GPTQ kernels for faster GPTQ inference #2111
What's Changed
- Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
- fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… by @sywangyi in https://github.com//pull/2148
- Fixing clippy. by @Narsil in #2149
- fix: use weights from base_layer by @drbh in #2141
- feat: download lora adapter weights from launcher by @drbh in #2140
- Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
- fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
- refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
- fix: prefer serde structs over custom functions by @drbh in #2127
- Fixing test. by @Narsil in #2152
- GH router. by @Narsil in #2153
- Fixing baichuan override. by @Narsil in #2158
- [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
- Fixing graph capture for flash decoding. by @Narsil in #2163
- fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
- fix: use the base layers weight in mistral rocm by @drbh in #2155
- Fixing rocm. by @Narsil in #2164
- Ci test by @glegendre01 in #2124
- Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
- feat: improve update_docs for openapi schema by @drbh in #2169
- Fixing the dockerfile warnings. by @Narsil in #2173
- Fixing missing
object
field for regular completions. by @Narsil in #2175
New Contributors
Full Changelog: v2.1.0...v2.1.1
v2.1.0
Notable changes
-
New models : gemma2
-
Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010
-
Faster GPTQ inference and Marlin support (up to 2x speedup).
-
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
-
Lots of Rocm support and bugfixes,
-
Lots of new contributors ! Thanks a lot for these contributions
What's Changed
- OpenAI function calling compatible support by @phangiabao98 in #1888
- Fixing types. by @Narsil in #1906
- Types. by @Narsil in #1909
- Fixing signals. by @Narsil in #1910
- Removing some unused code. by @Narsil in #1915
- MI300 compatibility by @fxmarty in #1764
- Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
- Update grafana template by @fxmarty in #1918
- Fix TunableOp bug by @fxmarty in #1920
- Fix TGI issues with ROCm by @fxmarty in #1921
- Fixing the download strategy for ibm-fms by @Narsil in #1917
- ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
- docs: Fix grafana dashboard url by @edwardzjl in #1925
- feat: include token in client test like server tests by @drbh in #1932
- Creating doc automatically for supported models. by @Narsil in #1929
- fix: use path inside of speculator config by @drbh in #1935
- feat: add train medusa head tutorial by @drbh in #1934
- reenable xpu for tgi by @sywangyi in #1939
- Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
- Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
- Improving the logging system. by @Narsil in #1938
- Fixing codellama loads by using purely
AutoTokenizer
. by @Narsil in #1947 - Fix seeded output. by @Narsil in #1949
- Fix (flash) Gemma prefix and enable tests by @danieldk in #1950
- Fix GPTQ for models which do not have float16 at the default dtype (simpler) by @danieldk in #1953
- Processor config chat template by @drbh in #1954
- fix small typo and broken link by @MoritzLaurer in #1958
- Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). by @Narsil in #1959
- Fix (non-container) pytest stdout buffering-related lock-up by @danieldk in #1963
- Fixing the text part from tokenizer endpoint. by @Narsil in #1967
- feat: adjust attn weight loading logic by @drbh in #1975
- Add support for exl2-quantized models by @danieldk in #1965
- Update documentation version to 2.0.4 by @fxmarty in #1980
- Purely refactors paged/attention into
layers/attention
and make hardware differences more obvious with 1 file per hardware. by @Narsil in #1986 - Fixing exl2 scratch buffer. by @Narsil in #1990
- single char ` addition for docs by @nbroad1881 in #1989
- Fixing GPTQ imports. by @Narsil in #1994
- reable xpu, broken by gptq and setuptool upgrade by @sywangyi in #1988
- router: send the input as chunks to the backend by @danieldk in #1981
- Fix Phi-2 with
tp>1
by @danieldk in #2003 - fix: update triton implementation reference by @emmanuel-ferdman in #2002
- feat: add SchedulerV3 by @OlivierDehaene in #1996
- Support GPTQ models with column-packed up/gate tensor by @danieldk in #2006
- Making
make install
work better by default. by @Narsil in #2004 - Hotfixing
make install
. by @Narsil in #2008 - Do not initialize scratch space when there are no ExLlamaV2 layers by @danieldk in #2015
- feat: move allocation logic to rust by @OlivierDehaene in #1835
- Fixing rocm. by @Narsil in #2021
- Fix GPTQWeight import by @danieldk in #2020
- Update version on init.py to 0.7.0 by @andimarafioti in #2017
- Add support for Marlin-quantized models by @danieldk in #2014
- marlin: support tp>1 when group_size==-1 by @danieldk in #2032
- marlin: improve build by @danieldk in #2031
- Internal runner ? by @Narsil in #2023
- Xpu gqa by @sywangyi in #2013
- server: use chunked inputs by @danieldk in #1985
- ROCm and sliding windows fixes by @fxmarty in #2033
- Add Phi-3 medium support by @danieldk in #2039
- feat(ci): add trufflehog secrets detection by @McPatate in #2038
- fix(ci): remove unnecessary permissions by @McPatate in #2045
- Update LLMM1 bound by @fxmarty in #2050
- Support chat response format by @drbh in #2046
- fix(server): fix OPT implementation by @OlivierDehaene in #2061
- fix(layers): fix SuRotaryEmbedding by @OlivierDehaene in #2060
- PR #2049 CI run by @drbh in #2054
- implement Open Inference Protocol endpoints by @drbh in #1942
- Add support for GPTQ Marlin by @danieldk in #2052
- Update the link for qwen2 by @xianbaoqian in #2068
- Adding architecture document by @tengomucho in #2044
- Support different image sizes in prefill in VLMs by @danieldk in #2065
- Contributing guide & Code of Conduct by @LysandreJik in #2074
- fix build.rs watch files by @zirconium-n in #2072
- Set maximum grpc message receive size to 2GiB by @danieldk in #2075
- CI: Tailscale improvements by @glegendre01 in #2079
- CI: pass pre-commit hooks again by @danieldk in #2084
- feat: rotate tests ci token by @drbh in #2091
- Support exl2-quantized Qwen2 models by @danieldk in #2085
- Factor out sharding of packed tensors by @...
v2.0.4
Main changes
What's Changed
- OpenAI function calling compatible support by @phangiabao98 in #1888
- Fixing types. by @Narsil in #1906
- Types. by @Narsil in #1909
- Fixing signals. by @Narsil in #1910
- Removing some unused code. by @Narsil in #1915
- MI300 compatibility by @fxmarty in #1764
- Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in #1908
- Update grafana template by @fxmarty in #1918
- Fix TunableOp bug by @fxmarty in #1920
- Fix TGI issues with ROCm by @fxmarty in #1921
- Fixing the download strategy for ibm-fms by @Narsil in #1917
- ROCm: make CK FA2 default instead of Triton by @fxmarty in #1924
- docs: Fix grafana dashboard url by @edwardzjl in #1925
- feat: include token in client test like server tests by @drbh in #1932
- Creating doc automatically for supported models. by @Narsil in #1929
- fix: use path inside of speculator config by @drbh in #1935
- feat: add train medusa head tutorial by @drbh in #1934
- reenable xpu for tgi by @sywangyi in #1939
- Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in #1937
- Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in #1869
- Improving the logging system. by @Narsil in #1938
- Fixing codellama loads by using purely
AutoTokenizer
. by @Narsil in #1947
New Contributors
- @phangiabao98 made their first contribution in #1888
- @edwardzjl made their first contribution in #1925
- @thomas-schillaci made their first contribution in #1869
Full Changelog: v2.0.3...v2.0.4
v2.0.3
Important changes
- Add: Support for the Falcon2 by @Nilabhra in #1886
- New speculation method MLPSpeculator. by @JRosenkranz in #1865
- Pali gemma modeling by @drbh in #1895
What's Changed
- Fix: "Fixing" double BOS for mistral too. by @Narsil in #1843
- Adding scripts to prepare load data. by @Narsil in #1841
- Remove misleading warning (not that important nowadays anyway). by @Narsil in #1848
- feat: prefer huggingface_hub in docs and show image api by @drbh in #1844
- Updating Phi3 (long context). by @Narsil in #1849
- Add router name to /info endpoint by @Wauplin in #1854
- Upgrading to rust 1.78. by @Narsil in #1851
- update xpu docker image and use public ipex whel by @sywangyi in #1860
- Refactor layers. by @Narsil in #1866
- Granite support? by @Narsil in #1882
- Add: Support for the Falcon2 11B architecture by @Nilabhra in #1886
- MLPSpeculator. by @JRosenkranz in #1865
- Fixing truncation. by @Narsil in #1890
- Correct 'using guidance' link by @brandon-lockaby in #1892
- Add GPT-2 with flash attention by @danieldk in #1889
- Removing accepted ids in the regular info logs, downgrade to debug. by @Narsil in #1898
- feat: add deprecation warning to clients by @drbh in #1855
- [Bug Fix] Update torch import reference in bnb quantization by @DhruvSrikanth in #1902
- Pali gemma modeling by @drbh in #1895
New Contributors
- @Nilabhra made their first contribution in #1886
- @brandon-lockaby made their first contribution in #1892
- @danieldk made their first contribution in #1889
- @DhruvSrikanth made their first contribution in #1902
Full Changelog: v2.0.2...v2.0.3
v2.0.2
Tl;dr
- New models (idefics2, phi3)
- Cleaner VLM support in the openai layer
- Upgraded to pytorch 2.3.0
What's Changed
- Make
--cuda-graphs 0
work as expected (bis) by @fxmarty in #1768 - fix typos in docs and add small clarifications by @MoritzLaurer in #1790
- Add attribute descriptions for
GenerateParameters
by @Wauplin in #1798 - feat: allow null eos and bos tokens in config by @drbh in #1791
- Phi3 support by @Narsil in #1797
- Idefics2. by @Narsil in #1756
- fix: avoid frequency and repetition penalty on padding tokens by @drbh in #1765
- Adding support for
HF_HUB_OFFLINE
support in the router. by @Narsil in #1789 - feat: improve temperature logic in chat by @drbh in #1749
- Updating the benchmarks so everyone uses openai compat layer. by @Narsil in #1800
- Update guidance docs to reflect grammar support in API by @dr3s in #1775
- Use the generation config. by @Narsil in #1808
- 2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). by @Narsil in #1816
- Adding new env variables for TPU backends. by @Narsil in #1755
- add intel xpu support for TGI by @sywangyi in #1475
- Blunder by @Narsil in #1815
- Fixing qwen2. by @Narsil in #1818
- Dummy CI run. by @Narsil in #1817
- Changing the waiting_served_ratio default (stack more aggressively by default). by @Narsil in #1820
- Better graceful shutdown. by @Narsil in #1827
- Add the missing
tool_prompt
parameter to Python client by @maziyarpanahi in #1825 - Small CI cleanup. by @Narsil in #1801
- Add reference to TPU support by @brandonroyal in #1760
- fix: use get_speculate to the number of layers by @OlivierDehaene in #1737
- feat: add how it works section by @drbh in #1773
- Fixing frequency penalty by @martinigoyanes in #1811
- feat: add vlm docs and simple examples by @drbh in #1812
- Handle images in chat api by @drbh in #1828
- chore: update torch by @OlivierDehaene in #1730
- (chore): torch 2.3.0 by @Narsil in #1833
New Contributors
- @MoritzLaurer made their first contribution in #1790
- @dr3s made their first contribution in #1775
- @maziyarpanahi made their first contribution in #1825
- @brandonroyal made their first contribution in #1760
- @martinigoyanes made their first contribution in #1811
Full Changelog: v2.0.1...v2.0.2