[#391] Add support for json structured output for vLLM #528

aidando73 · 2024-11-26T09:09:31Z

What does this PR do?

Addresses issue (#391)

Adds json structured output for vLLM
Enables structured output tests for vLLM

Give me a recipe for Spaghetti Bolognaise:

{
  "recipe_name": "Spaghetti Bolognaise",
  "preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.",
  "ingredients": [
    "500g minced beef",
    "1 medium onion, finely chopped",
    "2 cloves garlic, minced",
    "1 carrot, finely chopped",
    " celery, finely chopped",
    "1 (28 oz) can whole peeled tomatoes",
    "1 tbsp tomato paste",
    "1 tsp dried basil",
    "1 tsp dried oregano",
    "1 tsp salt",
    "1/2 tsp black pepper",
    "1/2 tsp sugar",
    "1 lb spaghetti",
    "Grated Parmesan cheese, for serving",
    "Extra virgin olive oil, for serving"
  ],
  "steps": [
    "Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.",
    "Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.",
    "Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.",
    "Add the tomato paste and cook for 1-2 minutes, stirring constantly.",
    "Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.",
    "Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.",
    "While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.",
    "Add the reserved pasta water to the sauce and stir to combine.",
    "Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.",
    "Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.",
    "Enjoy!"
  ]
}

Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B parameter model 👍

Test Plan

pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -k llama_3b-vllm_remote

With the following setup:

# Environment
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export INFERENCE_PORT=8000
export VLLM_URL=http://localhost:8000/v1

# vLLM server
sudo docker run --gpus all \
    -v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \
    -p 8000:$INFERENCE_PORT \
    --ipc=host \
    --net=host \
    vllm/vllm-openai:v0.6.3.post1 \
    --model $INFERENCE_MODEL

# llama-stack server
llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \
  --port 5001 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct

Results:

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================

Sources

Structured output with vLLM? vllm-project/vllm#8300
By default, vLLM uses https://github.com/dottxt-ai/outlines for structured outputs [1]

Before submitting

[N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case)

Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?

[N/A?] Updated relevant documentation. Couldn't find any relevant documentation. Lmk if I've missed anything.

Wrote necessary unit or integration tests.

aidando73 · 2024-11-26T09:51:31Z

llama_stack/providers/remote/inference/vllm/vllm.py

+            if fmt.type == ResponseFormatType.json_schema.value:
+                input_dict["extra_body"] = {
+                    "guided_json": request.response_format.json_schema
+                }


Note that using

input_dict["response_format"] = { "type": "json_schema", "json_schema": { "name": "example_name", "strict": True, "schema": request.response_format.json_schema } }

as per the usual OpenAI api, doesn't work. I've written a more detailed explanation here.

aidando73 · 2024-11-26T10:32:46Z

Script to play around with

import os
from llama_stack_client import LlamaStackClient
from pydantic import BaseModel

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

class CompletionMessage(BaseModel):
    recipe_name: str
    preamble: str
    ingredients: list[str]
    steps: list[str]

response = client.inference.chat_completion(
    model_id=os.environ["INFERENCE_MODEL"],
    messages=[
        {
            "role": "system",
            "content": "You are a chef, passionate about educating the world about delicious home cooked meals.",
        },
        {
            "role": "user",
            "content": "Give me a recipe for spaghetti bolognaise. Start with the recipe name, a preamble describing your childhood stories about spaghetti bolognaise, an ingredients list, and then the recipe steps.",
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": CompletionMessage.model_json_schema(),
    },
    sampling_params={"max_tokens": 8000},
)
print(response.completion_message.content)

(Outputs the example in PR description)

aidando73 · 2024-11-26T10:37:12Z

llama_stack/providers/remote/inference/vllm/vllm.py

+                raise NotImplementedError("Grammar response format not supported yet")
+            else:
+                raise ValueError(f"Unknown response format {fmt.type}")
+
        return {
            "model": request.model,
            **input_dict,


Prompt adaptation also working. This prompt:

{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about coding"},

Becomes:

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> Write a haiku about coding<|eot_id|><|start_header_id|>user<|end_header_id|> Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This gets added: Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}

aidando73 requested review from ashwinb, yanxi0830, hardikjshah, dltn and raghotham as code owners November 26, 2024 09:09

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2024

aidando73 marked this pull request as draft November 26, 2024 09:09

aidando73 changed the title . Add support for guided json decoding for vLLM provider Nov 26, 2024

aidando73 changed the title ~~Add support for guided json decoding for vLLM provider~~ [#391] Add support for json structured output for vLLM provider Nov 26, 2024

aidando73 changed the title ~~[#391] Add support for json structured output for vLLM provider~~ [#391] Add support for json structured output for vLLM Nov 26, 2024

[meta-llama#391] Add support for json structured output for vLLM

1801aa1

aidando73 force-pushed the aidand-391-guided-decoding-vllm_3 branch from 3cc464e to 1801aa1 Compare November 26, 2024 09:40

aidando73 commented Nov 26, 2024

View reviewed changes

aidando73 mentioned this pull request Nov 26, 2024

[#391] Add support for json structured output for vLLM - Support PR #529

Closed

aidando73 commented Nov 26, 2024

View reviewed changes

aidando73 marked this pull request as ready for review November 26, 2024 11:08

aidando73 mentioned this pull request Nov 26, 2024

Support guided decoding with vllm and remote::vllm #391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#391] Add support for json structured output for vLLM #528

[#391] Add support for json structured output for vLLM #528

aidando73 commented Nov 26, 2024 •

edited

Loading

aidando73 Nov 26, 2024 •

edited

Loading

aidando73 commented Nov 26, 2024 •

edited

Loading

aidando73 Nov 26, 2024 •

edited

Loading

[#391] Add support for json structured output for vLLM #528

Are you sure you want to change the base?

[#391] Add support for json structured output for vLLM #528

Conversation

aidando73 commented Nov 26, 2024 • edited Loading

What does this PR do?

Test Plan

Sources

Before submitting

aidando73 Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

aidando73 commented Nov 26, 2024 • edited Loading

Script to play around with

aidando73 Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

aidando73 commented Nov 26, 2024 •

edited

Loading

aidando73 Nov 26, 2024 •

edited

Loading

aidando73 commented Nov 26, 2024 •

edited

Loading

aidando73 Nov 26, 2024 •

edited

Loading