Releases: li-plus/chatglm.cpp
Releases · li-plus/chatglm.cpp
v0.4.2
- Apply flash attention on vision encoder for lower first-token latency.
- Fix metal compilation error on Apple silicon chips.
v0.4.1
- Support GLM4V, the first vision language model in GLM series
- Fix nan/inf logits by rescheduling attention scaling
v0.4.0
- Dynamic memory allocation on demand to fully utilize device memory. No preset scratch size or memory size any more.
- Drop Baichuan/InternLM support since they were integrated in llama.cpp.
- API change:
- CMake CUDA option:
-DGGML_CUBLAS
changed to -DGGML_CUDA
- CMake CUDA architecture:
-DCUDA_ARCHITECTURES
changed to -DCMAKE_CUDA_ARCHITECTURES
num_threads
in GenerationConfig
was removed: the optimal thread settings will be automatically selected.
v0.3.4
- Fix regex negative lookahead for code input tokenization
- Fix OpenAI API server by using
apply_chat_template
to calculate tokens
v0.3.3
Support ChatGLM4 conversation mode
v0.3.2
- Support p-tuning v2 finetuned models for ChatGLM family
- Fix convert.py for lora models & chatglm3-6b-128k
- Fix RoPE theta config for 32k/128k sequence length
- Better cuda cmake script respecting nvcc version
v0.3.1
- Support function calling in OpenAI api server
- Faster repetition penalty sampling
- Support max_new_tokens generation option
v0.3.0
- Full functionality of ChatGLM3 including system prompt, function call and code interpreter
- Brand new OpenAI-style chat API
- Add token usage information in OpenAI api server to be compatible with LangChain frontend
- Fix conversion error for chatglm3-6b-32k
v0.2.10
- Support ChatGLM3 in conversation mode.
- Coming soon: new prompt format for system message and function call.
v0.2.9
- Support InternLM 7B & 20B model architectures