v2.1.1
Main changes
- Bugfixes
- Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
- Use Marlin over GPTQ kernels for faster GPTQ inference #2111
What's Changed
- Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
- fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… by @sywangyi in https://github.com//pull/2148
- Fixing clippy. by @Narsil in #2149
- fix: use weights from base_layer by @drbh in #2141
- feat: download lora adapter weights from launcher by @drbh in #2140
- Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
- fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
- refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
- fix: prefer serde structs over custom functions by @drbh in #2127
- Fixing test. by @Narsil in #2152
- GH router. by @Narsil in #2153
- Fixing baichuan override. by @Narsil in #2158
- [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
- Fixing graph capture for flash decoding. by @Narsil in #2163
- fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
- fix: use the base layers weight in mistral rocm by @drbh in #2155
- Fixing rocm. by @Narsil in #2164
- Ci test by @glegendre01 in #2124
- Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
- feat: improve update_docs for openapi schema by @drbh in #2169
- Fixing the dockerfile warnings. by @Narsil in #2173
- Fixing missing
object
field for regular completions. by @Narsil in #2175
New Contributors
Full Changelog: v2.1.0...v2.1.1