-
Notifications
You must be signed in to change notification settings - Fork 632
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
llama stack distributions / templates / docker refactor (#266)
* docker compose ollama * comment * update compose file * readme for distributions * readme * move distribution folders * move distribution/templates to distributions/ * rename * kill distribution/templates * readme * readme * build/developer cookbook/new api provider * developer cookbook * readme * readme * [bugfix] fix case for agent when memory bank registered without specifying provider_id (#264) * fix case where memory bank is registered without provider_id * memory test * agents unit test * Add an option to not use elastic agents for meta-reference inference (#269) * Allow overridding checkpoint_dir via config * Small rename * Make all methods `async def` again; add completion() for meta-reference (#270) PR #201 had made several changes while trying to fix issues with getting the stream=False branches of inference and agents API working. As part of this, it made a change which was slightly gratuitous. Namely, making chat_completion() and brethren "def" instead of "async def". The rationale was that this allowed the user (within llama-stack) of this to use it as: ``` async for chunk in api.chat_completion(params) ``` However, it causes unnecessary confusion for several folks. Given that clients (e.g., llama-stack-apps) anyway use the SDK methods (which are completely isolated) this choice was not ideal. Let's revert back so the call now looks like: ``` async for chunk in await api.chat_completion(params) ``` Bonus: Added a completion() implementation for the meta-reference provider. Technically should have been another PR :) * Improve an important error message * update ollama for llama-guard3 * Add vLLM inference provider for OpenAI compatible vLLM server (#178) This PR adds vLLM inference provider for OpenAI compatible vLLM server. * Create .readthedocs.yaml Trying out readthedocs * Update event_logger.py (#275) spelling error * vllm * build templates * delete templates * tmp add back build to avoid merge conflicts * vllm * vllm --------- Co-authored-by: Ashwin Bharambe <[email protected]> Co-authored-by: Ashwin Bharambe <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: raghotham <[email protected]> Co-authored-by: nehal-a2z <[email protected]>
- Loading branch information
1 parent
c995219
commit 23210e8
Showing
32 changed files
with
850 additions
and
335 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Llama Stack Distribution | ||
|
||
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications. | ||
|
||
|
||
## Quick Start Llama Stack Distributions Guide | ||
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** | | ||
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: | | ||
| Meta Reference | llamastack/distribution-meta-reference-gpu | [Guide](./meta-reference-gpu/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | ||
| Ollama | llamastack/distribution-ollama | [Guide](./ollama/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | ||
| TGI | llamastack/distribution-tgi | [Guide](./tgi/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | |
2 changes: 1 addition & 1 deletion
2
...gs/local-bedrock-conda-example-build.yaml → distributions/bedrock/build.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
.../build_configs/local-fireworks-build.yaml → distributions/fireworks/build.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...uild_configs/local-hf-endpoint-build.yaml → distributions/hf-endpoint/build.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...ld_configs/local-hf-serverless-build.yaml → distributions/hf-serverless/build.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Meta Reference Distribution | ||
|
||
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations. | ||
|
||
|
||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** | | ||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- | | ||
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference | | ||
|
||
|
||
### Start the Distribution (Single Node GPU) | ||
|
||
> [!NOTE] | ||
> This assumes you have access to GPU to start a TGI server with access to your GPU. | ||
> [!NOTE] | ||
> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container. | ||
``` | ||
export LLAMA_CHECKPOINT_DIR=~/.llama | ||
``` | ||
|
||
> [!NOTE] | ||
> `~/.llama` should be the path containing downloaded weights of Llama models. | ||
|
||
To download and start running a pre-built docker container, you may use the following commands: | ||
|
||
``` | ||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu | ||
``` | ||
|
||
### Alternative (Build and start distribution locally via conda) | ||
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on starting up a meta-reference distribution. |
7 changes: 5 additions & 2 deletions
7
...build_configs/local-gpu-docker-build.yaml → distributions/meta-reference-gpu/build.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Ollama Distribution | ||
|
||
The `llamastack/distribution-ollama` distribution consists of the following provider configurations. | ||
|
||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** | | ||
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- | | ||
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference | | ||
|
||
|
||
### Start a Distribution (Single Node GPU) | ||
|
||
> [!NOTE] | ||
> This assumes you have access to GPU to start a Ollama server with access to your GPU. | ||
``` | ||
$ cd llama-stack/distribution/ollama/gpu | ||
$ ls | ||
compose.yaml run.yaml | ||
$ docker compose up | ||
``` | ||
|
||
You will see outputs similar to following --- | ||
``` | ||
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps" | ||
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps" | ||
INFO: Started server process [1] | ||
INFO: Waiting for application startup. | ||
INFO: Application startup complete. | ||
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) | ||
[llamastack] | Resolved 12 providers | ||
[llamastack] | inner-inference => ollama0 | ||
[llamastack] | models => __routing_table__ | ||
[llamastack] | inference => __autorouted__ | ||
``` | ||
|
||
To kill the server | ||
``` | ||
docker compose down | ||
``` | ||
|
||
### Start the Distribution (Single Node CPU) | ||
|
||
> [!NOTE] | ||
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only. | ||
``` | ||
$ cd llama-stack/distribution/ollama/cpu | ||
$ ls | ||
compose.yaml run.yaml | ||
$ docker compose up | ||
``` | ||
|
||
### (Alternative) ollama run + llama stack Run | ||
|
||
If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands. | ||
|
||
#### Start Ollama server. | ||
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details. | ||
|
||
**Via Docker** | ||
``` | ||
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama | ||
``` | ||
|
||
**Via CLI** | ||
``` | ||
ollama run <model_id> | ||
``` | ||
|
||
#### Start Llama Stack server pointing to Ollama server | ||
|
||
**Via Docker** | ||
``` | ||
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./ollama-run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-ollama.yaml | ||
``` | ||
|
||
Make sure in you `ollama-run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g. | ||
``` | ||
inference: | ||
- provider_id: ollama0 | ||
provider_type: remote::ollama | ||
config: | ||
url: http://127.0.0.1:14343 | ||
``` | ||
|
||
**Via Conda** | ||
|
||
``` | ||
llama stack build --config ./build.yaml | ||
llama stack run ./gpu/run.yaml | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: distribution-ollama | ||
distribution_spec: | ||
description: Use ollama for running LLM inference | ||
providers: | ||
inference: remote::ollama | ||
memory: | ||
- meta-reference | ||
- remote::chromadb | ||
- remote::pgvector | ||
safety: meta-reference | ||
agents: meta-reference | ||
telemetry: meta-reference | ||
image_type: conda |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
services: | ||
ollama: | ||
image: ollama/ollama:latest | ||
network_mode: "host" | ||
volumes: | ||
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast | ||
ports: | ||
- "11434:11434" | ||
command: [] | ||
llamastack: | ||
depends_on: | ||
- ollama | ||
image: llamastack/llamastack-local-cpu | ||
network_mode: "host" | ||
volumes: | ||
- ~/.llama:/root/.llama | ||
# Link to ollama run.yaml file | ||
- ./run.yaml:/root/my-run.yaml | ||
ports: | ||
- "5000:5000" | ||
# Hack: wait for ollama server to start before starting docker | ||
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml" | ||
deploy: | ||
restart_policy: | ||
condition: on-failure | ||
delay: 3s | ||
max_attempts: 5 | ||
window: 60s | ||
volumes: | ||
ollama: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
version: '2' | ||
built_at: '2024-10-08T17:40:45.325529' | ||
image_name: local | ||
docker_image: null | ||
conda_env: local | ||
apis: | ||
- shields | ||
- agents | ||
- models | ||
- memory | ||
- memory_banks | ||
- inference | ||
- safety | ||
providers: | ||
inference: | ||
- provider_id: ollama0 | ||
provider_type: remote::ollama | ||
config: | ||
url: http://127.0.0.1:14343 | ||
safety: | ||
- provider_id: meta0 | ||
provider_type: meta-reference | ||
config: | ||
llama_guard_shield: | ||
model: Llama-Guard-3-1B | ||
excluded_categories: [] | ||
disable_input_check: false | ||
disable_output_check: false | ||
prompt_guard_shield: | ||
model: Prompt-Guard-86M | ||
memory: | ||
- provider_id: meta0 | ||
provider_type: meta-reference | ||
config: {} | ||
agents: | ||
- provider_id: meta0 | ||
provider_type: meta-reference | ||
config: | ||
persistence_store: | ||
namespace: null | ||
type: sqlite | ||
db_path: ~/.llama/runtime/kvstore.db | ||
telemetry: | ||
- provider_id: meta0 | ||
provider_type: meta-reference | ||
config: {} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
services: | ||
ollama: | ||
image: ollama/ollama:latest | ||
network_mode: "host" | ||
volumes: | ||
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast | ||
ports: | ||
- "11434:11434" | ||
devices: | ||
- nvidia.com/gpu=all | ||
environment: | ||
- CUDA_VISIBLE_DEVICES=0 | ||
command: [] | ||
deploy: | ||
resources: | ||
reservations: | ||
devices: | ||
- driver: nvidia | ||
# that's the closest analogue to --gpus; provide | ||
# an integer amount of devices or 'all' | ||
count: 1 | ||
# Devices are reserved using a list of capabilities, making | ||
# capabilities the only required field. A device MUST | ||
# satisfy all the requested capabilities for a successful | ||
# reservation. | ||
capabilities: [gpu] | ||
runtime: nvidia | ||
llamastack-local-cpu: | ||
depends_on: | ||
- ollama | ||
image: llamastack/llamastack-local-cpu | ||
network_mode: "host" | ||
volumes: | ||
- ~/.llama:/root/.llama | ||
# Link to ollama run.yaml file | ||
- ./ollama-run.yaml:/root/llamastack-run-ollama.yaml | ||
ports: | ||
- "5000:5000" | ||
# Hack: wait for ollama server to start before starting docker | ||
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-ollama.yaml" | ||
deploy: | ||
restart_policy: | ||
condition: on-failure | ||
delay: 3s | ||
max_attempts: 5 | ||
window: 60s | ||
volumes: | ||
ollama: |
Oops, something went wrong.