Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typo in server #573

Merged
merged 1 commit into from
Jun 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions model_servers/llamacpp_python/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
# Llamacpp_Python Model Sever
# Llamacpp_Python Model Server

The llamacpp_python model server images are based on the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) project that provides python bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp). This provides us with a python based and OpenAI API compatible model server that can run LLM's of various sizes locally across Linux, Windows or Mac.

This model server requires models to be converted from their original format, typically a set of `*.bin` or `*.safetensor` files into a single GGUF formatted file. Many models are available in GGUF format already on [huggingface.co](https://huggingface.co). You can also use the [model converter utility](../../convert_models/) available in this repo to convert models yourself.
This model server requires models to be converted from their original format, typically a set of `*.bin` or `*.safetensor` files into a single GGUF formatted file. Many models are available in GGUF format already on [huggingface.co](https://huggingface.co). You can also use the [model converter utility](../../convert_models/) available in this repo to convert models yourself.


## Image Options

We currently provide 3 options for the llamacpp_python model server:
* [Base](#base)
We currently provide 3 options for the llamacpp_python model server:
* [Base](#base)
* [Cuda](#cuda)
* [Vulkan (experimental)](#vulkan-experimental)
* [Vulkan (experimental)](#vulkan-experimental)

### Base

The [base image](../llamacpp_python/base/Containerfile) is the standard image that works for both arm64 and amd64 environments. However, it does not includes any hardware acceleration and will run with CPU only. If you use the base image, make sure that your container runtime has sufficient resources to run the desired model(s).
The [base image](../llamacpp_python/base/Containerfile) is the standard image that works for both arm64 and amd64 environments. However, it does not includes any hardware acceleration and will run with CPU only. If you use the base image, make sure that your container runtime has sufficient resources to run the desired model(s).

To build the base model service image:

Expand All @@ -30,7 +30,7 @@ podman pull quay.io/ai-lab/llamacpp_python

### Cuda

The [Cuda image](../llamacpp_python/cuda/Containerfile) include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.
The [Cuda image](../llamacpp_python/cuda/Containerfile) include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.

To Build the the Cuda variant image:
```bash
Expand All @@ -45,9 +45,9 @@ podman pull quay.io/ai-lab/llamacpp_python_cuda

**IMPORTANT!**

To run the Cuda image with GPU acceleration, you need to install the correct [Cuda drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation) for your system along with the [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#). Please use the links provided to find installation instructions for your system.
To run the Cuda image with GPU acceleration, you need to install the correct [Cuda drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation) for your system along with the [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#). Please use the links provided to find installation instructions for your system.

Once those are installed you can use the container toolkit CLI to discover your Nvidia device(s).
Once those are installed you can use the container toolkit CLI to discover your Nvidia device(s).
```bash
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
Expand All @@ -57,7 +57,7 @@ Finally, you will also need to add `--device nvidia.com/gpu=all` to your `podman

### Vulkan (experimental)

The [Vulkan image](../llamacpp_python/vulkan/Containerfile) is experimental, but can be used for gaining partial GPU access on an M-series Mac, significantly speeding up model response time over a CPU only deployment. This image requires that your podman machine provider is "applehv" and that you use krunkit instead of vfkit. Since these tools are not currently supported by podman desktop this image will remain "experimental".
The [Vulkan image](../llamacpp_python/vulkan/Containerfile) is experimental, but can be used for gaining partial GPU access on an M-series Mac, significantly speeding up model response time over a CPU only deployment. This image requires that your podman machine provider is "applehv" and that you use krunkit instead of vfkit. Since these tools are not currently supported by podman desktop this image will remain "experimental".

To build the Vulkan model service variant image:

Expand All @@ -77,7 +77,7 @@ podman pull quay.io/ai-lab/llamacpp_python_vulkan
There are many models to choose from these days, most of which can be found on [huggingface.co](https://huggingface.co). In order to use a model with the llamacpp_python model server, it must be in GGUF format. You can either download pre-converted GGUF models directly or convert them yourself with the [model converter utility](../../convert_models/) available in this repo.

A well performant Apache-2.0 licensed models that we recommend using if you are just getting started is
`granite-7b-lab`. You can use the link below to quickly download a quantized (smaller) GGUF version of this model for use with the llamacpp_python model server.
`granite-7b-lab`. You can use the link below to quickly download a quantized (smaller) GGUF version of this model for use with the llamacpp_python model server.

Download URL: [https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf](https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf)

Expand Down Expand Up @@ -108,9 +108,9 @@ To deploy the LLM server you must specify a volume mount `-v` where your models
podman run --rm -it \
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models:ro \
-e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf
-e HOST=0.0.0.0
-e PORT=8001
-e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf
-e HOST=0.0.0.0
-e PORT=8001
-e MODEL_CHAT_FORMAT=openchat
llamacpp_python \
```
Expand All @@ -122,9 +122,9 @@ podman run --rm -it \
--device nvidia.com/gpu=all
-p 8001:8001 \
-v Local/path/to/locallm/models:/locallm/models:ro \
-e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf
-e HOST=0.0.0.0
-e PORT=8001
-e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf
-e HOST=0.0.0.0
-e PORT=8001
-e MODEL_CHAT_FORMAT=openchat
llamacpp_python \
```
Expand Down Expand Up @@ -154,7 +154,7 @@ Here is an example `models_config.json` with two model options.
}
```

Now run the container with the specified config file.
Now run the container with the specified config file.

```bash
podman run --rm -it -d \
Expand Down