Skip to content

Releases: huggingface/text-generation-inference

v0.5.0

11 Apr 18:32
6f0f1d7
Compare
Choose a tag to compare

Features

  • server: add flash-attention based version of Llama
  • server: add flash-attention based version of Santacoder
  • server: support OPT models
  • router: make router input validation optional
  • docker: improve layer caching

Fix

  • server: improve token streaming decoding
  • server: fix escape charcaters in stop sequences
  • router: fix NCCL desync issues
  • router: use buckets for metrics histograms

v0.4.3

30 Mar 15:29
fef1a1c
Compare
Choose a tag to compare

Fix

  • router: fix OTLP distributed tracing initialization

v0.4.2

30 Mar 15:10
84722f3
Compare
Choose a tag to compare

Features

  • benchmark: tui based benchmarking tool
  • router: Clear cache on error
  • server: Add mypy-protobuf
  • server: reduce mlp and attn in one op for flash neox
  • image: aws sagemaker compatible image

Fix

  • server: avoid try/except to determine the kind of AutoModel
  • server: fix flash neox rotary embedding

v0.4.1

26 Mar 14:38
ab5fd8c
Compare
Choose a tag to compare

Features

  • server: New faster GPTNeoX implementation based on flash attention

Fix

  • server: fix input-length discrepancy between Rust and Python tokenizers

v0.4.0

09 Mar 15:10
411d624
Compare
Choose a tag to compare

Features

  • router: support best_of sampling
  • router: support left truncation
  • server: support typical sampling
  • launcher: allow local models
  • clients: add text-generation Python client
  • launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES

Fix

  • server: do not warp prefill logits
  • server: fix formatting issues in generate_stream tokens
  • server: fix galactica batch
  • server: fix index out of range issue with watermarking

v0.3.2

03 Mar 17:42
1c19b09
Compare
Choose a tag to compare

Features

  • router: add support for huggingface api-inference
  • server: add logits watermark with "A Watermark for Large Language Models"
  • server: use a fixed transformers commit

Fix

  • launcher: add missing parameters to launcher
  • server: update to hf_transfer==0.1.2 to fix corrupted files issue

v0.3.1

24 Feb 12:27
4b1c972
Compare
Choose a tag to compare

Features

  • server: allocate full attention mask to decrease latency
  • server: enable hf-transfer for insane download speeds
  • router: add CORS options

Fix

  • server: remove position_ids from galactica forward

v0.3.0

16 Feb 16:33
c720555
Compare
Choose a tag to compare

Features

  • server: support t5 models
  • router: add max_total_tokens and empty_input validation
  • launcher: add the possibility to disable custom CUDA kernels
  • server: add automatic safetensors conversion
  • router: add prometheus scrape endpoint
  • server, router: add distributed tracing

Fix

  • launcher: copy current env vars to subprocesses
  • docker: add note around shared memory

v0.2.1

07 Feb 14:41
2fe5e1b
Compare
Choose a tag to compare

Fix

  • server: fix bug with repetition penalty when using GPUs and inference mode

v0.2.0

03 Feb 11:56
20c3c59
Compare
Choose a tag to compare

Features

  • router: support Token streaming using Server Side Events
  • router: support seeding
  • server: support gpt-neox
  • server: support santacoder
  • server: support repetition penalty
  • server: allow the server to use a local weight cache

Breaking changes

  • router: refactor Token API
  • router: modify /generate API to only return generated text

Misc

  • router: use background task to manage request queue
  • ci: docker build/push on update