Release v2.1.1 · huggingface/text-generation-inference

Main changes

Bugfixes
Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
Use Marlin over GPTQ kernels for faster GPTQ inference #2111

Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… by @sywangyi in https://github.com//pull/2148
Fixing clippy. by @Narsil in #2149
fix: use weights from base_layer by @drbh in #2141
feat: download lora adapter weights from launcher by @drbh in #2140
Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
fix: prefer serde structs over custom functions by @drbh in #2127
Fixing test. by @Narsil in #2152
GH router. by @Narsil in #2153
Fixing baichuan override. by @Narsil in #2158
[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
Fixing graph capture for flash decoding. by @Narsil in #2163
fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
fix: use the base layers weight in mistral rocm by @drbh in #2155
Fixing rocm. by @Narsil in #2164
Ci test by @glegendre01 in #2124
Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
feat: improve update_docs for openapi schema by @drbh in #2169
Fixing the dockerfile warnings. by @Narsil in #2173
Fixing missing object field for regular completions. by @Narsil in #2175

Full Changelog: v2.1.0...v2.1.1