-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#391] Add support for json structured output for vLLM #528
base: main
Are you sure you want to change the base?
[#391] Add support for json structured output for vLLM #528
Conversation
3cc464e
to
1801aa1
Compare
if fmt.type == ResponseFormatType.json_schema.value: | ||
input_dict["extra_body"] = { | ||
"guided_json": request.response_format.json_schema | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that using
input_dict["response_format"] = {
"type": "json_schema",
"json_schema": {
"name": "example_name",
"strict": True,
"schema": request.response_format.json_schema
}
}
as per the usual OpenAI api, doesn't work. I've written a more detailed explanation here.
Script to play around withimport os
from llama_stack_client import LlamaStackClient
from pydantic import BaseModel
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
class CompletionMessage(BaseModel):
recipe_name: str
preamble: str
ingredients: list[str]
steps: list[str]
response = client.inference.chat_completion(
model_id=os.environ["INFERENCE_MODEL"],
messages=[
{
"role": "system",
"content": "You are a chef, passionate about educating the world about delicious home cooked meals.",
},
{
"role": "user",
"content": "Give me a recipe for spaghetti bolognaise. Start with the recipe name, a preamble describing your childhood stories about spaghetti bolognaise, an ingredients list, and then the recipe steps.",
},
],
response_format={
"type": "json_schema",
"json_schema": CompletionMessage.model_json_schema(),
},
sampling_params={"max_tokens": 8000},
)
print(response.completion_message.content) (Outputs the example in PR description) |
raise NotImplementedError("Grammar response format not supported yet") | ||
else: | ||
raise ValueError(f"Unknown response format {fmt.type}") | ||
|
||
return { | ||
"model": request.model, | ||
**input_dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt adaptation also working. This prompt:
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding"},
Becomes:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Write a haiku about coding<|eot_id|><|start_header_id|>user<|end_header_id|>
Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
This gets added: Please respond in JSON format with the schema: {properties: {content: {title: Content, type: string}, additional_info: {title: Additional Info, type: string}}, required: [content, additional_info], title: CompletionMessage, type: object}
What does this PR do?
Addresses issue (#391)
Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B parameter model 👍
Test Plan
pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -k llama_3b-vllm_remote
With the following setup:
Results:
Sources
Before submitting
[N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case)
Pull Request section?
[N/A?] Updated relevant documentation. Couldn't find any relevant documentation. Lmk if I've missed anything.