Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#391] Add support for json structured output for vLLM #528

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aidando73
Copy link

@aidando73 aidando73 commented Nov 26, 2024

What does this PR do?

Addresses issue (#391)

  • Adds json structured output for vLLM
  • Enables structured output tests for vLLM

Give me a recipe for Spaghetti Bolognaise:

{
  "recipe_name": "Spaghetti Bolognaise",
  "preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.",
  "ingredients": [
    "500g minced beef",
    "1 medium onion, finely chopped",
    "2 cloves garlic, minced",
    "1 carrot, finely chopped",
    " celery, finely chopped",
    "1 (28 oz) can whole peeled tomatoes",
    "1 tbsp tomato paste",
    "1 tsp dried basil",
    "1 tsp dried oregano",
    "1 tsp salt",
    "1/2 tsp black pepper",
    "1/2 tsp sugar",
    "1 lb spaghetti",
    "Grated Parmesan cheese, for serving",
    "Extra virgin olive oil, for serving"
  ],
  "steps": [
    "Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.",
    "Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.",
    "Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.",
    "Add the tomato paste and cook for 1-2 minutes, stirring constantly.",
    "Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.",
    "Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.",
    "While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.",
    "Add the reserved pasta water to the sauce and stir to combine.",
    "Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.",
    "Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.",
    "Enjoy!"
  ]
}

Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B parameter model 👍

Test Plan

pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -k llama_3b-vllm_remote

With the following setup:

# Environment
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export INFERENCE_PORT=8000
export VLLM_URL=http://localhost:8000/v1

# vLLM server
sudo docker run --gpus all \
    -v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \
    -p 8000:$INFERENCE_PORT \
    --ipc=host \
    --net=host \
    vllm/vllm-openai:v0.6.3.post1 \
    --model $INFERENCE_MODEL

# llama-stack server
llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \
  --port 5001 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct

Results:

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================

Sources

Before submitting

[N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case)

  • Ran pre-commit to handle lint / formatting issues.
  • Read the contributor guideline,
    Pull Request section?

[N/A?] Updated relevant documentation. Couldn't find any relevant documentation. Lmk if I've missed anything.

  • Wrote necessary unit or integration tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2024
@aidando73 aidando73 marked this pull request as draft November 26, 2024 09:09
@aidando73 aidando73 changed the title . Add support for guided json decoding for vLLM provider Nov 26, 2024
@aidando73 aidando73 changed the title Add support for guided json decoding for vLLM provider [#391] Add support for json structured output for vLLM provider Nov 26, 2024
@aidando73 aidando73 changed the title [#391] Add support for json structured output for vLLM provider [#391] Add support for json structured output for vLLM Nov 26, 2024
@aidando73 aidando73 force-pushed the aidand-391-guided-decoding-vllm_3 branch from 3cc464e to 1801aa1 Compare November 26, 2024 09:40
if fmt.type == ResponseFormatType.json_schema.value:
input_dict["extra_body"] = {
"guided_json": request.response_format.json_schema
}
Copy link
Author

@aidando73 aidando73 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that using

input_dict["response_format"] = {
  "type": "json_schema",
  "json_schema": {
    "name": "example_name",
    "strict": True,
    "schema": request.response_format.json_schema
  }
}

as per the usual OpenAI api, doesn't work. I've written a more detailed explanation here.

@aidando73
Copy link
Author

aidando73 commented Nov 26, 2024

Script to play around with

import os
from llama_stack_client import LlamaStackClient
from pydantic import BaseModel

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

class CompletionMessage(BaseModel):
    recipe_name: str
    preamble: str
    ingredients: list[str]
    steps: list[str]

response = client.inference.chat_completion(
    model_id=os.environ["INFERENCE_MODEL"],
    messages=[
        {
            "role": "system",
            "content": "You are a chef, passionate about educating the world about delicious home cooked meals.",
        },
        {
            "role": "user",
            "content": "Give me a recipe for spaghetti bolognaise. Start with the recipe name, a preamble describing your childhood stories about spaghetti bolognaise, an ingredients list, and then the recipe steps.",
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": CompletionMessage.model_json_schema(),
    },
    sampling_params={"max_tokens": 8000},
)
print(response.completion_message.content)

(Outputs the example in PR description)

raise NotImplementedError("Grammar response format not supported yet")
else:
raise ValueError(f"Unknown response format {fmt.type}")

return {
"model": request.model,
**input_dict,
Copy link
Author

@aidando73 aidando73 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt adaptation also working. This prompt:

        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},

Becomes:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a haiku about coding<|eot_id|><|start_header_id|>user<|end_header_id|>

Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This gets added: Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}

@aidando73 aidando73 marked this pull request as ready for review November 26, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants