v0.4.0
Features
- router: support best_of sampling
- router: support left truncation
- server: support typical sampling
- launcher: allow local models
- clients: add text-generation Python client
- launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES
Fix
- server: do not warp prefill logits
- server: fix formatting issues in generate_stream tokens
- server: fix galactica batch
- server: fix index out of range issue with watermarking