Release v0.6.5 · PygmalionAI/aphrodite-engine

What's Changed

xpu: refactor XPU worker & executor by @AlpinDale in #861
build: add jinja2 to requirements file by @AlpinDale in #862
attention: add AttentionState abstraction by @AlpinDale in #863
xpu: disable punica kernels for XPU by @AlpinDale in #864
executor: pipe worker_class_fn arg in executor by @AlpinDale in #865
server: log the process occupying our port by @AlpinDale in #866
feat: AWQ quantization for InternVL by @AlpinDale in #867
Rewrite DRY sampler to be a lot faster by @50h100a in #868
fix: ROCm build by @Naomiusearch in #817
fix: temp_last warning being repeated for every output token by @AlpinDale in #869
feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
async: avoid premature exit in the async generator by @AlpinDale in #872
cpu: fix mm_limits initialization by @AlpinDale in #873
spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
sampler: pad dry sequence breakers tensor by @AlpinDale in #875
fix: add_generation_template -> add_generation_prompt in llm by @AlpinDale in #877
Update README.md by @NoahBPeterson in #876
api: fix crashes under very high loads by @AlpinDale in #878
build: pass PYTHONPATH from setup.py to cmake by @AlpinDale in #879
async: disable multi-step scheduling for sync engine by @AlpinDale in #880
api: better startup failure UX by @AlpinDale in #881
chore: consolidate environment variables within one file by @AlpinDale in #882
core: fix spec decode metrics and envs circular import by @AlpinDale in #889
feat: add support for audio models by @AlpinDale in #891
distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
build: fix invalid path for envs.py in setup by @AlpinDale in #894
kernel: use cub::BlockReduce instead of custom impl by @AlpinDale in #895
fix: Phi 3.5 Vision model loading by @AlpinDale in #896
api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
spec decode: add support for EAGLE by @AlpinDale in #899
fix: ShardedStateLoader with fp8 quant by @AlpinDale in #900
kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
ray: better error when placement group topology is incorrect by @AlpinDale in #906
xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
fix: empty prompt crashing the server by @AlpinDale in #912
quantization: update marlin to use AphroditeParameters by @AlpinDale in #913
core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
api: add json_schema to OpenAI server by @AlpinDale in #915
fix: phi3v crash with unusual image sizes by @AlpinDale in #916
feat: multi-image input support for Phi3V by @AlpinDale in #917
spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
api: use fp32 for base64 embeddings by @AlpinDale in #919
core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
quants: update qqq and gptq_marlin_24 to use AphroditeParameters by @AlpinDale in #921
distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
neuron: add support for tensor parallelism by @AlpinDale in #923
quants: update compressed tensors lifecycle to remove prefix from create_weights by @AlpinDale in #924
feat: add async postprocessor by @AlpinDale in #925
api: add endpoint for loading and unloading the model by @AlpinDale in #926
feat: add single user mode by @AlpinDale in #927
api: add inline model loading by @AlpinDale in #928
api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
fix: inline model loading conflicts with lora by @AlpinDale in #930
core: do not compile for profiling by @AlpinDale in #931
xpu: support pipeline parallel by @AlpinDale in #932
fix: phi3v image_idx in async server by @AlpinDale in #933
feat: add fused Marlin MoE kernel by @AlpinDale in #934
chore: multi-image support for llava-next by @AlpinDale in #935
model: add support for paligemma2 by @AlpinDale in #936
vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
core: add virtual engine for async outproc by @AlpinDale in #939
api: log prompt truncation by @AlpinDale in #940
vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with --tokenizer-mode mistral by @khanonnie in #943
core: use flashinfer for FP8 KV when available by @AlpinDale in #944
tests: update flashinfer test for #944 by @AlpinDale in #945
quants: add triton kernels for AWQ by @AlpinDale in #946
tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
fix: do not register punica with torch if using older torch by @AlpinDale in #948
tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
fix: issues with flashinfer fp8 kv by @AlpinDale in #950
api: optimize zeromq frontend performance by @AlpinDale in #951
tpu: remove torch._dynamo.reset() by @AlpinDale in #952
vlm: fix errors on ragged NestedTensors by @AlpinDale in #953
spec decode: match the original rank computation impl for spec decoding by @AlpinDale in #954
core: support multi-step scheduling w/ async post-processor by @AlpinDale in #955
Revert "fix: issues with flashinfer fp8 kv (#950)" by @AlpinDale in #956
misc: extend cuda graph capture size for H200 by @AlpinDale in #957
fix: gguf vocab embddings in TP by @AlpinDale in #958
quant: update tpu_int8 to use AphroditeParameters by @AlpinDale in #959
neuron: support for context length and token bucketing by @AlpinDale in #960
quant: support pre-quanted bitsandbytes checkpoints by @AlpinDale in #961
vlm: do not allow max_model_len overflow by @AlpinDale in #962
core: support logprobs with multi-step scheduling by @AlpinDale in #963
ci: bump aphrodite version to 0.6.5 by @AlpinDale in #964

New Contributors

@NoahBPeterson made their first contribution in #876
@khanonnie made their first contribution in #943

Full Changelog: v0.6.4.post1...v0.6.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.5

What's Changed

New Contributors

Contributors