Skip to content

Commit

Permalink
llama stack distributions / templates / docker refactor (#266)
Browse files Browse the repository at this point in the history
* docker compose ollama

* comment

* update compose file

* readme for distributions

* readme

* move distribution folders

* move distribution/templates to distributions/

* rename

* kill distribution/templates

* readme

* readme

* build/developer cookbook/new api provider

* developer cookbook

* readme

* readme

* [bugfix] fix case for agent when memory bank registered without specifying provider_id (#264)

* fix case where memory bank is registered without provider_id

* memory test

* agents unit test

* Add an option to not use elastic agents for meta-reference inference (#269)

* Allow overridding checkpoint_dir via config

* Small rename

* Make all methods `async def` again; add completion() for meta-reference (#270)

PR #201 had made several changes while trying to fix issues with getting the stream=False branches of inference and agents API working. As part of this, it made a change which was slightly gratuitous. Namely, making chat_completion() and brethren "def" instead of "async def".

The rationale was that this allowed the user (within llama-stack) of this to use it as:

```
async for chunk in api.chat_completion(params)
```

However, it causes unnecessary confusion for several folks. Given that clients (e.g., llama-stack-apps) anyway use the SDK methods (which are completely isolated) this choice was not ideal. Let's revert back so the call now looks like:

```
async for chunk in await api.chat_completion(params)
```

Bonus: Added a completion() implementation for the meta-reference provider. Technically should have been another PR :)

* Improve an important error message

* update ollama for llama-guard3

* Add vLLM inference provider for OpenAI compatible vLLM server (#178)

This PR adds vLLM inference provider for OpenAI compatible vLLM server.

* Create .readthedocs.yaml

Trying out readthedocs

* Update event_logger.py (#275)

spelling error

* vllm

* build templates

* delete templates

* tmp add back build to avoid merge conflicts

* vllm

* vllm

---------

Co-authored-by: Ashwin Bharambe <[email protected]>
Co-authored-by: Ashwin Bharambe <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: raghotham <[email protected]>
Co-authored-by: nehal-a2z <[email protected]>
  • Loading branch information
6 people authored Oct 21, 2024
1 parent c995219 commit 23210e8
Show file tree
Hide file tree
Showing 32 changed files with 850 additions and 335 deletions.
11 changes: 11 additions & 0 deletions distributions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Llama Stack Distribution

A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.


## Quick Start Llama Stack Distributions Guide
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
| Meta Reference | llamastack/distribution-meta-reference-gpu | [Guide](./meta-reference-gpu/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| Ollama | llamastack/distribution-ollama | [Guide](./ollama/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| TGI | llamastack/distribution-tgi | [Guide](./tgi/) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-bedrock-conda-example
name: bedrock
distribution_spec:
description: Use Amazon Bedrock APIs.
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-databricks
name: databricks
distribution_spec:
description: Use Databricks for running LLM inference
providers:
Expand All @@ -7,4 +7,4 @@ distribution_spec:
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
image_type: conda
image_type: conda
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-fireworks
name: fireworks
distribution_spec:
description: Use Fireworks.ai for running LLM inference
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-hf-endpoint
name: hf-endpoint
distribution_spec:
description: "Like local, but use Hugging Face Inference Endpoints for running LLM inference.\nSee https://hf.co/docs/api-endpoints."
providers:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: local-hf-serverless
name: hf-serverless
distribution_spec:
description: "Like local, but use Hugging Face Inference API (serverless) for running LLM inference.\nSee https://hf.co/docs/api-inference."
providers:
Expand Down
33 changes: 33 additions & 0 deletions distributions/meta-reference-gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Meta Reference Distribution

The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.


| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |


### Start the Distribution (Single Node GPU)

> [!NOTE]
> This assumes you have access to GPU to start a TGI server with access to your GPU.
> [!NOTE]
> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
```
export LLAMA_CHECKPOINT_DIR=~/.llama
```

> [!NOTE]
> `~/.llama` should be the path containing downloaded weights of Llama models.

To download and start running a pre-built docker container, you may use the following commands:

```
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
```

### Alternative (Build and start distribution locally via conda)
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on starting up a meta-reference distribution.
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
name: local-gpu
name: distribution-meta-reference-gpu
distribution_spec:
description: Use code from `llama_stack` itself to serve all llama stack APIs
providers:
inference: meta-reference
memory: meta-reference
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ apis:
- safety
providers:
inference:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
model: Llama3.1-8B-Instruct
Expand All @@ -22,7 +22,7 @@ providers:
max_seq_len: 4096
max_batch_size: 1
safety:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
Expand All @@ -33,18 +33,18 @@ providers:
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta-reference
- provider_id: meta0
provider_type: meta-reference
config: {}
91 changes: 91 additions & 0 deletions distributions/ollama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Ollama Distribution

The `llamastack/distribution-ollama` distribution consists of the following provider configurations.

| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |---------------- |---------------- |---------------------------------- |---------------- |---------------- |
| **Provider(s)** | remote::ollama | meta-reference | remote::pgvector, remote::chroma | remote::ollama | meta-reference |


### Start a Distribution (Single Node GPU)

> [!NOTE]
> This assumes you have access to GPU to start a Ollama server with access to your GPU.
```
$ cd llama-stack/distribution/ollama/gpu
$ ls
compose.yaml run.yaml
$ docker compose up
```

You will see outputs similar to following ---
```
[ollama] | [GIN] 2024/10/18 - 21:19:41 | 200 | 226.841µs | ::1 | GET "/api/ps"
[ollama] | [GIN] 2024/10/18 - 21:19:42 | 200 | 60.908µs | ::1 | GET "/api/ps"
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
[llamastack] | Resolved 12 providers
[llamastack] | inner-inference => ollama0
[llamastack] | models => __routing_table__
[llamastack] | inference => __autorouted__
```

To kill the server
```
docker compose down
```

### Start the Distribution (Single Node CPU)

> [!NOTE]
> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
```
$ cd llama-stack/distribution/ollama/cpu
$ ls
compose.yaml run.yaml
$ docker compose up
```

### (Alternative) ollama run + llama stack Run

If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.

#### Start Ollama server.
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.

**Via Docker**
```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

**Via CLI**
```
ollama run <model_id>
```

#### Start Llama Stack server pointing to Ollama server

**Via Docker**
```
docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./ollama-run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-ollama.yaml
```

Make sure in you `ollama-run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
```
inference:
- provider_id: ollama0
provider_type: remote::ollama
config:
url: http://127.0.0.1:14343
```

**Via Conda**

```
llama stack build --config ./build.yaml
llama stack run ./gpu/run.yaml
```
13 changes: 13 additions & 0 deletions distributions/ollama/build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: distribution-ollama
distribution_spec:
description: Use ollama for running LLM inference
providers:
inference: remote::ollama
memory:
- meta-reference
- remote::chromadb
- remote::pgvector
safety: meta-reference
agents: meta-reference
telemetry: meta-reference
image_type: conda
30 changes: 30 additions & 0 deletions distributions/ollama/cpu/compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
services:
ollama:
image: ollama/ollama:latest
network_mode: "host"
volumes:
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
ports:
- "11434:11434"
command: []
llamastack:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to ollama run.yaml file
- ./run.yaml:/root/my-run.yaml
ports:
- "5000:5000"
# Hack: wait for ollama server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
deploy:
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s
volumes:
ollama:
46 changes: 46 additions & 0 deletions distributions/ollama/cpu/run.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
inference:
- provider_id: ollama0
provider_type: remote::ollama
config:
url: http://127.0.0.1:14343
safety:
- provider_id: meta0
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
disable_input_check: false
disable_output_check: false
prompt_guard_shield:
model: Prompt-Guard-86M
memory:
- provider_id: meta0
provider_type: meta-reference
config: {}
agents:
- provider_id: meta0
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: ~/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta0
provider_type: meta-reference
config: {}
48 changes: 48 additions & 0 deletions distributions/ollama/gpu/compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
services:
ollama:
image: ollama/ollama:latest
network_mode: "host"
volumes:
- ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
ports:
- "11434:11434"
devices:
- nvidia.com/gpu=all
environment:
- CUDA_VISIBLE_DEVICES=0
command: []
deploy:
resources:
reservations:
devices:
- driver: nvidia
# that's the closest analogue to --gpus; provide
# an integer amount of devices or 'all'
count: 1
# Devices are reserved using a list of capabilities, making
# capabilities the only required field. A device MUST
# satisfy all the requested capabilities for a successful
# reservation.
capabilities: [gpu]
runtime: nvidia
llamastack-local-cpu:
depends_on:
- ollama
image: llamastack/llamastack-local-cpu
network_mode: "host"
volumes:
- ~/.llama:/root/.llama
# Link to ollama run.yaml file
- ./ollama-run.yaml:/root/llamastack-run-ollama.yaml
ports:
- "5000:5000"
# Hack: wait for ollama server to start before starting docker
entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-ollama.yaml"
deploy:
restart_policy:
condition: on-failure
delay: 3s
max_attempts: 5
window: 60s
volumes:
ollama:
Loading

0 comments on commit 23210e8

Please sign in to comment.