New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add NVIDIA NIM inference adapter #355

Merged

ashwinb merged 19 commits into meta-llama:main from mattf:add-nvidia-inference-adapter

Nov 23, 2024

Contributor

mattf commented Nov 1, 2024 •

edited

Loading

What does this PR do?

this PR adds a basic inference adapter to NVIDIA NIMs

what it does -

chat completion api
- tool calls
- streaming
- structured output
- logprobs
support hosted NIM on integrate.api.nvidia.com
support downloaded NIM containers

what it does not do -

completion api
embedding api
vision models
builtin tools
have certainty that sampling strategies are correct

Feature/Issue validation/testing/test plan

pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...

all tests should pass. there are pydantic v1 warnings.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

mattf requested review from ashwinb, yanxi0830, hardikjshah, dltn and raghotham as code owners

November 1, 2024 17:40

facebook-github-bot added the CLA Signed label

mattf added 5 commits

November 4, 2024 10:23


          add NVIDIA NIM inference adapter

2dd8c4b


          enable streaming support, use openai-python instead of httpx

dbe665e


          Merge branch 'main' into add-nvidia-inference-adapter

43262df


          Merge branch 'main' into add-nvidia-inference-adapter

c24f882


          Merge branch 'main' into add-nvidia-inference-adapter

2a25ace

mattf force-pushed the add-nvidia-inference-adapter branch from a5760c0 to 2a25ace Compare

November 19, 2024 16:38

mattf added 6 commits

November 19, 2024 12:49


          map llama model -> provider model id in ModelRegistryHelper

2980a18


          align with other remote adapters, rename config base_url -> url

4ccf4ef


          Merge branch 'main' into add-nvidia-inference-adapter

8a35dc8


          allow users to provide a tool

914cea8


          Merge branch 'main' into add-nvidia-inference-adapter

5fbfb9d


          use pydantic v2's model_dump() instead of dict()

3ed2e81

ashwinb reviewed

View reviewed changes

llama_stack/providers/remote/inference/nvidia/__init__.py Outdated

+              # the root directory of this source tree.
+              from ._config import NVIDIAConfig
+              from ._nvidia import NVIDIAInferenceAdapter

Contributor

ashwinb Nov 21, 2024

this should be a dynamic import within get_adapter_impl() -- we want configs to be manipulated without needing implementation dependencies.

Contributor Author

mattf Nov 21, 2024

ptal


          dynamically import NVIDIAInferenceAdapter

988741c

ashwinb reviewed

View reviewed changes

Contributor

ashwinb left a comment

Thank you for this PR. So good!

Re: testing, I'd like to have a reproducible e2e test (ala what we have in providers/tests/inference/test_text_inference.py and providers/tests/inference/test_vision_inference.py) -- just having an nvidia specific fixture there which could then be invoked as

pytest -s -v --providers inference=nvidia test_text_inference.py --env ...

would be great.

llama_stack/providers/remote/inference/nvidia/_config.py Outdated

+                  )
+                  @property
+                  def is_hosted(self) -> bool:

Contributor

ashwinb Nov 21, 2024

this could be is_nvidia_hosted perhaps?

Contributor Author

mattf Nov 21, 2024

it's really an internal thing. i've removed it from the NVIDIAConfig api entirely.

llama_stack/providers/remote/inference/nvidia/_nvidia.py Outdated

		@@ -0,0 +1,182 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

Contributor

ashwinb Nov 21, 2024

this is a nitpick, you can ignore it you feel strongly. we don't usually do underscores in files in the repo - at least not yet. we don't even strongly enforce what symbols get exported out a module (that part is a bit sad admittedly.) could you make the files not have starting underscores?

Contributor Author

mattf Nov 21, 2024

my inclination is to be cautious about the exported symbols, but it's important to be cohesive w/ the project. i'll change these. ptal.

llama_stack/providers/remote/inference/nvidia/_nvidia.py Outdated

+              from llama_models.datatypes import SamplingParams
+              from llama_models.llama3.api.datatypes import (
+                  InterleavedTextMedia,

Contributor

ashwinb Nov 21, 2024

thanks for the explicit imports. we will be code-modding all our other code to do this sane thing soon :)

Contributor Author

mattf Nov 21, 2024

i spent so much time trying to figure out which classes were coming from which packages 😆

llama_stack/providers/remote/inference/nvidia/_nvidia.py Outdated

+                      CoreModelId.llama3_2_90b_vision_instruct.value,
+                  ),
+                  # TODO(mf): how do we handle Nemotron models?
+                  # "Llama3.1-Nemotron-51B-Instruct" -> "meta/llama-3.1-nemotron-51b-instruct",

Contributor

ashwinb Nov 21, 2024

is there a "base" llama model this model would correspond most closely with? we like to know it because we try to format tools, etc. in a way which the model will work best with. this isn't strictly necessary if the provider / API works very robustly with tool calling, etc. but so far given our experience with various "openai" wrapper APIs, it has been spotty.

Contributor Author

mattf Nov 21, 2024

nvidia/llama-3.1-nemotron-51b-instruct (typo in my comment) is https://build.nvidia.com/nvidia/llama-3_1-nemotron-51b-instruct/modelcard

there's now a 70b variant at https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct/modelcard

llama_stack/providers/remote/inference/nvidia/_nvidia.py Outdated

+                      stream: Optional[bool] = False,
+                      logprobs: Optional[LogProbConfig] = None,
+                  ) -> Union[CompletionResponse, AsyncIterator[CompletionResponseStreamChunk]]:
+                      raise NotImplementedError()

Contributor

ashwinb Nov 21, 2024

any chance this could be done? it's OK if not, but we have gone back and filled up many of the missing completion() methods also now

Contributor Author

mattf Nov 21, 2024

let me come back and add it in another PR, same for embedding

llama_stack/providers/remote/inference/nvidia/_nvidia.py Outdated

+                      ChatCompletionResponse, AsyncIterator[ChatCompletionResponseStreamChunk]
+                  ]:
+                      if tool_prompt_format:
+                          warnings.warn("tool_prompt_format is not supported by NVIDIA NIM, ignoring")

Contributor

ashwinb Nov 21, 2024

❤️

mattf added 3 commits

November 21, 2024 15:08


          move is_hosted out of the NVIDIAConfig api


          rename all _file.py to file.py

1e18791


          add nvidia provider for inference tests

5ab3773

Contributor Author

mattf commented Nov 21, 2024

Thank you for this PR. So good!

Re: testing, I'd like to have a reproducible e2e test (ala what we have in providers/tests/inference/test_text_inference.py and providers/tests/inference/test_vision_inference.py) -- just having an nvidia specific fixture there which could then be invoked as
pytest -s -v --providers inference=nvidia test_text_inference.py --env ...
would be great.

i've updated the PR description to note that this does not cover structured output, vision models, embedding or completion apis.

if it's ok, i'll follow up with PRs to add those features.

➜ pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/test_{text,vision}_inference.py --env NVIDIA_API_KEY=... --inference-model Llama3.1-8B-Instruct
/home/matt/.conda/envs/stack/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /home/matt/.conda/envs/stack/bin/python
cachedir: .pytest_cache
rootdir: /home/matt/Documents/Repositories/meta-llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0, httpx-0.34.0
asyncio: mode=strict, default_loop_scope=None
collected 11 items                                                                                                                                                                             

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-nvidia] Resolved 4 providers
 inner-inference => nvidia
 models => __routing_table__
 inference => __autorouted__
 inspect => __builtin__

Initializing NVIDIAInferenceAdapter(https://integrate.api.nvidia.com)...
Models: Llama3.1-8B-Instruct served by nvidia

PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] SKIPPED (Other inference providers don't support completion() yet)
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[-nvidia] SKIPPED (This test is not quite robust)
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-nvidia] SKIPPED (Other inference providers don't support structured output yet)
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-nvidia] PASSED
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-nvidia-image0-expected_strings0] SKIPPED (Other...)
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-nvidia-image1-expected_strings1] SKIPPED (Other...)
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_streaming[-nvidia] SKIPPED (Other inference providers don't su...)

========================================================================== 5 passed, 6 skipped, 10 warnings in 29.23s ==========================================================================

mattf added 2 commits

November 22, 2024 05:54


          use build_model_alias to support huggingface-style model names

a6f47f1


          add chat completion support for JsonSchemaResponseFormat request_format

e6b82a4

Contributor Author

mattf commented Nov 22, 2024

@ashwinb i find test_structured_output to be flakey. it's both a functionality and accuracy test -

        answer = AnswerFormat.model_validate_json(response.completion_message.content)
        assert answer.first_name == "Michael"
        assert answer.last_name == "Jordan"
        assert answer.year_of_birth == 1963
        assert answer.num_seasons_in_nba == 15

it's an accuracy test because it checks the value of first/last name, birth year, and num seasons.

i find that -

llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality portion
llama-3.2-3b-instruct consistently fails the accuracy portion (thinking MJ was in the NBA for 14 seasons)
llama-3.1-8b-instruct occasionally fails the accuracy portion

suggestions (not mutually exclusive) -

turn the test into functionality only, skip the value checks
split the test into a functionality version and an xfail accuracy version
add context to the prompt so the llm can answer without accessing embedded memory

mattf added 2 commits

November 22, 2024 10:34


          improve Llama Stack -> OpenAI message conversion

0e496d1


          enable test filtering w/ --providers inference=...

bae9e87

mattf requested a review from ashwinb

November 22, 2024 20:52

Contributor

ashwinb commented Nov 23, 2024

@mattf I agree with your comments on test_structured_output completely. I think the third option makes the most sense. I will update the test pronto.

ashwinb approved these changes

View reviewed changes

Contributor

ashwinb left a comment

Looks good to me. Merging!

ashwinb merged commit 4e6c984 into meta-llama:main

2 checks passed

mattf mentioned this pull request

reduce the accuracy requirements to pass the chat completion structured output test #522

Merged

4 tasks

SLR722 pushed a commit that referenced this pull request


          add NVIDIA NIM inference adapter (#355)

4be0f8e

# What does this PR do?

this PR adds a basic inference adapter to NVIDIA NIMs

what it does -
 - chat completion api
   - tool calls
   - streaming
   - structured output
   - logprobs
 - support hosted NIM on integrate.api.nvidia.com
 - support downloaded NIM containers

what it does not do -
 - completion api
 - embedding api
 - vision models
 - builtin tools
 - have certainty that sampling strategies are correct

## Feature/Issue validation/testing/test plan

`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`

all tests should pass. there are pydantic v1 warnings.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?

Thanks for contributing 🎉!

mattf mentioned this pull request

map llama model -> provider model id in ModelRegistryHelper #484

Closed

5 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ashwinb ashwinb approved these changes

yanxi0830 Awaiting requested review from yanxi0830 yanxi0830 is a code owner

hardikjshah Awaiting requested review from hardikjshah hardikjshah is a code owner

dltn Awaiting requested review from dltn dltn is a code owner

raghotham Awaiting requested review from raghotham raghotham is a code owner

Labels