Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

huggingface / text-generation-inference Public

Notifications You must be signed in to change notification settings
Fork 1.1k
Star 9.2k

Code
Issues 130
Pull requests 10
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Releases: huggingface/text-generation-inference

Releases Tags

Releases · huggingface/text-generation-inference

v0.5.0

11 Apr 18:32

OlivierDehaene

v0.5.0

6f0f1d7

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.5.0

Features

server: add flash-attention based version of Llama
server: add flash-attention based version of Santacoder
server: support OPT models
router: make router input validation optional
docker: improve layer caching

Fix

server: improve token streaming decoding
server: fix escape charcaters in stop sequences
router: fix NCCL desync issues
router: use buckets for metrics histograms

Assets 2

MatthewShao, hawgjmrd72, tscholak, and NickMandylas reacted with hooray emoji

All reactions

🎉 4 reactions

4 people reacted

v0.4.3

30 Mar 15:29

OlivierDehaene

v0.4.3

fef1a1c

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.4.3

Fix

router: fix OTLP distributed tracing initialization

Assets 2

All reactions

v0.4.2

30 Mar 15:10

OlivierDehaene

v0.4.2

84722f3

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.4.2

Features

benchmark: tui based benchmarking tool
router: Clear cache on error
server: Add mypy-protobuf
server: reduce mlp and attn in one op for flash neox
image: aws sagemaker compatible image

Fix

server: avoid try/except to determine the kind of AutoModel
server: fix flash neox rotary embedding

Assets 2

All reactions

v0.4.1

26 Mar 14:38

OlivierDehaene

v0.4.1

ab5fd8c

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.4.1

Features

server: New faster GPTNeoX implementation based on flash attention

Fix

server: fix input-length discrepancy between Rust and Python tokenizers

Assets 2

All reactions

v0.4.0

09 Mar 15:10

OlivierDehaene

v0.4.0

411d624

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.4.0

Features

router: support best_of sampling
router: support left truncation
server: support typical sampling
launcher: allow local models
clients: add text-generation Python client
launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES

Fix

server: do not warp prefill logits
server: fix formatting issues in generate_stream tokens
server: fix galactica batch
server: fix index out of range issue with watermarking

Assets 2

All reactions

v0.3.2

03 Mar 17:42

OlivierDehaene

v0.3.2

1c19b09

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.3.2

Features

router: add support for huggingface api-inference
server: add logits watermark with "A Watermark for Large Language Models"
server: use a fixed transformers commit

Fix

launcher: add missing parameters to launcher
server: update to hf_transfer==0.1.2 to fix corrupted files issue

Assets 2

All reactions

v0.3.1

24 Feb 12:27

OlivierDehaene

v0.3.1

4b1c972

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.3.1

Features

server: allocate full attention mask to decrease latency
server: enable hf-transfer for insane download speeds
router: add CORS options

Fix

server: remove position_ids from galactica forward

Assets 2

All reactions

v0.3.0

16 Feb 16:33

OlivierDehaene

v0.3.0

c720555

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.3.0

Features

server: support t5 models
router: add max_total_tokens and empty_input validation
launcher: add the possibility to disable custom CUDA kernels
server: add automatic safetensors conversion
router: add prometheus scrape endpoint
server, router: add distributed tracing

Fix

launcher: copy current env vars to subprocesses
docker: add note around shared memory

Assets 2

All reactions

v0.2.1

07 Feb 14:41

OlivierDehaene

v0.2.1

2fe5e1b

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.2.1

Fix

server: fix bug with repetition penalty when using GPUs and inference mode

Assets 2

All reactions

v0.2.0

03 Feb 11:56

OlivierDehaene

v0.2.0

20c3c59

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

View all tags

v0.2.0

Features

router: support Token streaming using Server Side Events
router: support seeding
server: support gpt-neox
server: support santacoder
server: support repetition penalty
server: allow the server to use a local weight cache

Breaking changes

router: refactor Token API
router: modify /generate API to only return generated text

Misc

router: use background task to manage request queue
ci: docker build/push on update

Assets 2

Narsil reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

Previous 1 2 3 4 5 Next

Previous Next

Footer

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.