Releases: huggingface/text-generation-inference
Releases Β· huggingface/text-generation-inference
v2.0.1
v2.0.0
TGI is back to Apache 2.0!
Highlights
- License was reverted to Apache 2.0
- Cuda graphs are now used by default. They improve latency substancially on high end nodes.
- Llava-next was added. It is the second multimodal model available on TGI after Idefics.
- Cohere Command R+ support. TGI is the fastest open source backend for Command R+
- FP8 support.
- We now share the vocabulary for all medusa heads, greatly improving latency and memory use.
Try out Command R+ with Medusa heads on 4xA100s with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
What's Changed
- Add cuda graphs sizes and make it default. by @Narsil in #1703
- Pickle conversion now requires
--trust-remote-code
. by @Narsil in #1704 - Push users to streaming in the readme. by @Narsil in #1698
- Fixing cohere tokenizer. by @Narsil in #1697
- Force weights_only (before fully breaking pickle files anyway). by @Narsil in #1710
- Regenerate ld.so.cache by @oOraph in #1708
- Revert license to Apache 2.0 by @OlivierDehaene in #1714
- Automatic quantization config. by @Narsil in #1719
- Adding Llava-Next (Llava 1.6) with full support. by @Narsil in #1709
- fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in #1707
- Update libraries by @abhishekkrthakur in #1713
- Dev/mask ldconfig output v2 by @oOraph in #1716
- Fp8 Support by @Narsil in #1726
- Upgrade EETQ (Fixes the cuda graphs). by @Narsil in #1729
- fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in #1731
- chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in #1651
- Improve the defaults for the launcher by @Narsil in #1727
- feat: medusa shared by @OlivierDehaene in #1734
- Fix typo in guidance.md by @eltociear in #1735
New Contributors
- @somehowchris made their first contribution in #1651
Full Changelog: v1.4.5...v2.0.0
v.1.4.5
Highlights
What's Changed
- fix: adjust logprob response logic by @drbh in #1682
- fix: handle batches with and without grammars by @drbh in #1676
- feat: Add dbrx support by @OlivierDehaene in #1685
Full Changelog: v1.4.4...v1.4.5
v.1.4.4
Highlights
- CohereForAI/c4ai-command-r-v01 model support
What's Changed
- Handle concurrent grammar requests by @drbh in #1610
- Fix idefics default. by @Narsil in #1614
- Fix async client timeout by @hugoabonizio in #1617
- accept legacy request format and response by @drbh in #1527
- add missing stop parameter for chat request by @drbh in #1619
- correctly index into mask when applying grammar by @drbh in #1618
- Use a better model for the quick tour by @lewtun in #1639
- Upgrade nix version from 0.27.1 to 0.28.0 by @yuanwu2017 in #1638
- Update peft + transformers + accelerate + bnb + safetensors by @abhishekkrthakur in #1646
- Fix index in ChatCompletionChunk by @Wauplin in #1648
- Fixing minor typo in documentation: supported hardware section by @SachinVarghese in #1632
- bump minijina and add test for core templates by @drbh in #1626
- support force downcast after FastRMSNorm multiply for Gemma by @drbh in #1658
- prefer spaces url over temp url by @drbh in #1662
- improve tool type, bump pydantic and outlines by @drbh in #1650
- Remove unecessary cuda graph. by @Narsil in #1664
- Repair idefics integration tests. by @Narsil in #1663
- fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py by @SeongBeomLEE in #1637
- Inline images for multimodal models. by @Narsil in #1666
New Contributors
- @hugoabonizio made their first contribution in #1617
- @yuanwu2017 made their first contribution in #1638
- @abhishekkrthakur made their first contribution in #1646
- @Wauplin made their first contribution in #1648
- @SachinVarghese made their first contribution in #1632
- @SeongBeomLEE made their first contribution in #1637
Full Changelog: v1.4.3...v1.4.4
v1.4.3
Highlights
- Add support for Starcoder 2
- Add support for Qwen2
What's Changed
- fix openapi schema by @OlivierDehaene in #1586
- avoid default message by @drbh in #1579
- Revamp medusa implementation so that every model can benefit. by @Narsil in #1588
- Support tools by @drbh in #1587
- Fixing x-compute-time. by @Narsil in #1606
- Fixing guidance docs. by @Narsil in #1607
- starcoder2 by @OlivierDehaene in #1605
- Qwen2 by @Jason-CKY in #1608
Full Changelog: v1.4.2...v1.4.3
v1.4.2
Highlights
- Add support for Google Gemma models
What's Changed
- Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). by @Narsil in #1571
- improve endpoint support by @drbh in #1577
- refactor syntax to correctly include structs by @drbh in #1580
- fix openapi and add jsonschema validation by @OlivierDehaene in #1578
- add support for Gemma by @OlivierDehaene in #1583
Full Changelog: v1.4.1...v1.4.2
v1.4.1
Highlights
- Mamba support by @drbh in #1480 and by @Narsil in #1552
- Experimental support for cuda graphs by @OlivierDehaene in #1428
- Outlines guided generation by @drbh in #1539
- Added
name
field to OpenAI compatible API Messages by @amihalik in #1563
What's Changed
- Fixing top_n_tokens. by @Narsil in #1497
- Sending compute type from the environment instead of hardcoded string by @Narsil in #1504
- Create the compute type at launch time (if not provided in the env). by @Narsil in #1505
- Modify default for max_new_tokens in python client by @freitng in #1336
- feat: eetq gemv optimization when batch_size <= 4 by @dtlzhuangz in #1502
- fix: improve messages api docs content and formatting by @drbh in #1506
- GPTNeoX: Use static rotary embedding by @dwyatte in #1498
- Hotfix the / health - route. by @Narsil in #1515
- fix: tokenizer config should use local model path when possible by @drbh in #1518
- Updating tokenizers. by @Narsil in #1517
- [docs] Fix link to Install CLI by @pcuenca in #1526
- feat: add ie update to message docs by @drbh in #1523
- feat: use existing add_generation_prompt variable from config in temp⦠by @drbh in #1533
- Update to peft 0.8.2 by @Stillerman in #1537
- feat(server): add frequency penalty by @OlivierDehaene in #1541
- chore: bump ci rust version by @drbh in #1543
- ROCm AWQ support by @IlyasMoutawwakil in #1514
- feat(router): add max_batch_size by @OlivierDehaene in #1542
- feat: add deserialize_with that handles strings or objects with content by @drbh in #1550
- Fixing glibc version in the runtime. by @Narsil in #1556
- Upgrade intermediary layer for nvidia too. by @Narsil in #1557
- Improving mamba runtime by using updates by @Narsil in #1552
- Small cleanup. by @Narsil in #1560
- Bugfix: eos and bos tokens positions are inconsistent by @amihalik in #1567
- chore: add pre-commit by @OlivierDehaene in #1569
- feat: add chat template struct to avoid tuple ordering errors by @OlivierDehaene in #1570
- v1.4.1 by @OlivierDehaene in #1568
New Contributors
- @freitng made their first contribution in #1336
- @dtlzhuangz made their first contribution in #1502
- @dwyatte made their first contribution in #1498
- @pcuenca made their first contribution in #1526
- @Stillerman made their first contribution in #1537
- @IlyasMoutawwakil made their first contribution in #1514
- @amihalik made their first contribution in #1563
Full Changelog: v1.4.0...v1.4.1
v1.4.0
Highlights
- OpenAI compatible API #1427
- exllama v2 Tensor Parallel #1490
- GPTQ support for AMD GPUs #1489
- Phi support #1442
What's Changed
- fix: fix local loading for .bin models by @OlivierDehaene in #1419
- Fix missing make target platform for local install: 'install-flash-attention-v2' by @deepily in #1414
- fix: follow base model for tokenizer in router by @OlivierDehaene in #1424
- Fix local load for Medusa by @PYNing in #1420
- Return prompt vs generated tokens. by @Narsil in #1436
- feat: supports openai chat completions API by @drbh in #1427
- feat: support raise_exception, bos and eos tokens by @drbh in #1450
- chore: bump rust version and annotate/fix all clippy warnings by @drbh in #1455
- feat: conditionally toggle chat on invocations route by @drbh in #1454
- Disable
decoder_input_details
on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in #1470 - Fixing non divisible embeddings. by @Narsil in #1476
- Add messages api compatibility docs by @drbh in #1478
- Add a new
/tokenize
route to get the tokenized input by @Narsil in #1471 - feat: adds phi model by @drbh in #1442
- fix: read stderr in download by @OlivierDehaene in #1486
- fix: show warning with tokenizer config parsing error by @drbh in #1488
- fix: launcher doc typos by @Narsil in #1473
- Reinstate exl2 with tp by @Narsil in #1490
- Add sealion mpt support by @Narsil in #1477
- Trying to fix that flaky test. by @Narsil in #1491
- fix: launcher doc typos by @thelinuxkid in #1462
- Update the docs to include newer models. by @Narsil in #1492
- GPTQ support on ROCm by @fxmarty in #1489
- feat: add tokenizer-config-path to launcher args by @drbh in #1495
New Contributors
- @deepily made their first contribution in #1414
- @PYNing made their first contribution in #1420
- @drbh made their first contribution in #1427
- @EndlessReform made their first contribution in #1470
- @thelinuxkid made their first contribution in #1462
Full Changelog: v1.3.4...v1.4.0
v1.3.4
What's Changed
- feat: relax mistral requirements by @OlivierDehaene in #1351
- fix: fix logic if sliding window key is not present in config by @OlivierDehaene in #1352
- fix: fix offline (#1341) by @OlivierDehaene in #1347
- fix: fix gpt-q with groupsize = -1 by @OlivierDehaene in #1358
- Peft safetensors. by @Narsil in #1364
- Change URL for Habana Gaudi support in doc by @regisss in #1343
- feat: update exllamav2 kernels by @OlivierDehaene in #1370
- Fix local load for peft by @Narsil in #1373
Full Changelog: v1.3.3...v1.3.4
v1.3.3
What's Changed
- fix gptq params loading
- improve decode latency for long sequences two fold
- feat: add more latency metrics in forward by @OlivierDehaene in #1346
- fix: max_past default value must be -1, not 0 by @OlivierDehaene in #1348
Full Changelog: v1.3.2...v1.3.3