Releases: li-plus/chatglm.cpp
Releases · li-plus/chatglm.cpp
v0.2.8
- Metal backend support for all models (ChatGLM & ChatGLM2 & Baichuan-7B & Baichuan-13B)
- Fix GLM generation on CUDA for long context
v0.2.7
- Support Baichuan-7B model architecture (works for both Baichuan v1 & v2).
- Minor bug fix and enhancement.
v0.2.6
- Support Baichuan-13B on CPU & CUDA backends
- Bug fix for Windows and Metal
v0.2.5
- Optimize context computing (GEMM) for metal backend
- Support repetition penalty option for generation
- Update Dockerfile for CPU & CUDA backends with full functionality, hosted on GHCR
v0.2.4
- Python binding enhancement: support load-and-convert directly from original Hugging Face models. Intermediate GGML model files are no longer necessary.
- Small fix for CLI demo on Windows.
v0.2.3
- Windows support: enable AVX/AVX2 for better performance, fix stdout encoding issues, and support python binding on Windows.
- API server: support LangChain integration & OpenAI API compatible server.
- New model: Support CodeGeeX2 model inference in native c++ & python binding.
v0.2.2
- Support MPS (Metal Performance Shaders) backend on Apple silicon devices for ChatGLM2.
- Support Volta, Turing and Ampere CUDA architectures.
v0.2.1
- 3x speedup for CUDA implementation.
- Increase scratch size to accommodate up to 2k context.
v0.2.0
First release:
- Accelerated CPU inference for ChatGLM-6B and ChatGLM2-6B for real-time chatting on MacBook.
- Support int4/int5/int8 quantization, KV cache, efficient sampling, parallel computing and streaming generation.
- Python binding, web demo, and more possibilities.