Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

passthrough arg from api to model.forward #838

Closed
wants to merge 3 commits into from

Conversation

qpwo
Copy link
Contributor

@qpwo qpwo commented Nov 26, 2024

Please feel welcome to make any and all changes without consulting me!

Addresses #836

About this PR (and my limited understanding of how aphrodite works)

  • A passthrough will come in from the API and get to either SamplingParams or PoolingParams.
  • The goal is to pass it around and get it all the way to model_executable(**execute_model_kwargs).
  • So we need to get passthrough onto ModelInputForXPU, the open vino thing, ModelInputForGPU, ModelInputForNeuron, and ModelInputForTPU.
  • I only implemented distributed stuff for GPU I think.
  • A function has access to the passthrough if it gets SamplingParams or PoolingParams or LLMInputs or EncoderDecoderLLMInputs.
    • I tried to keep redundant passing of passthrough to a minimum
  • I think `passthrough' travels between GPUs and nodes/machines along ModelInputForGPUBuilder.InterDataForSeqGroup?
  • Not sure this is how you're supposed to do all this.
  • Seems to work for me.

There are a couple little todos that might need done, i cant tell

@qpwo
Copy link
Contributor Author

qpwo commented Nov 26, 2024

Here's a (rather long) video of it working

_00005.mp4

(Again, if I drop off, feel free to close it or rewrite it or whatever. No need to ask me for permission.)

@qpwo
Copy link
Contributor Author

qpwo commented Nov 26, 2024

I should have time for one or two rounds of improvements though

@AlpinDale AlpinDale self-requested a review November 26, 2024 16:13
@qpwo
Copy link
Contributor Author

qpwo commented Nov 26, 2024

Lmk what I should do with this

@qpwo
Copy link
Contributor Author

qpwo commented Nov 26, 2024

just rebased

Copy link
Member

@AlpinDale AlpinDale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, mostly just nits for now.

This PR doesn't seem to be complete yet, so I can't make a decisive approval. Can you commit an example usage script for this? I'm still struggling to see how it can be used in its current state 😅

return passthrough

from aphrodite.common.pooling_params import PoolingParams # noqa: E402
from aphrodite.common.sampling_params import SamplingParams # noqa: E402
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these imported at the bottom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prevent circular import

@@ -92,6 +92,8 @@ class LLM:
serving, use the :class:`~aphrodite.AsyncAphrodite` class instead.
"""

# TODO:Luke do we need `enable_passthrough_param` stuff in llm.py?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave it for now, since passthrough seems to be a REST-API first-class.

@@ -340,6 +341,11 @@ def forward(
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states

_printed = set()
def print_once(key, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a log_once() function in aphrodite.common.logger. You can use that instead of this.

@50h100a
Copy link
Collaborator

50h100a commented Nov 27, 2024

Setting aside for a moment ways the PR could be improved, I'm also struggling to find a motivating use-case for this as it is now.

Do you have a specific example of something this would allow you to do, that would be much harder without it?

@qpwo
Copy link
Contributor Author

qpwo commented Nov 27, 2024

Do you have a specific example of something this would allow you to do, that would be much harder without it?

I am adding steering vectors to llama 405b. So I can just put save_activation=True or steer_with=1234 in passthrough. If PR is not a good fit then I'll close it and just use my fork.

@AlpinDale
Copy link
Member

AlpinDale commented Nov 27, 2024

If you need Steering/Control Vectors in particular, I've been working on this PR to enable support for it: #604

I've been neglecting it for a while but I'll get back to it as soon as possible.

@qpwo
Copy link
Contributor Author

qpwo commented Nov 27, 2024

Oh sweet! But yeah I wanted something a bit more flexible. Quickly make changes in the <model>.pycode itself. Hmm. So should I finish this PR or not?

@qpwo
Copy link
Contributor Author

qpwo commented Nov 27, 2024

I also was adding custom parameters to the samplers. It's nice to not have to change code in 20 places to try something out. But this is your repo not mine! Lmk

@AlpinDale
Copy link
Member

If you think you can make the changes as unintrusive as possible, I'm not against it. What do you think @50h100a ?

In case we do merge the PR, I'd appreciate it if you helped with the maintenance related to passthrough when needed. Since we don't have many maintainers, we try and add only the features we think we can comfortably maintain.

@qpwo
Copy link
Contributor Author

qpwo commented Nov 27, 2024

Would be happy to try and patch it up when issues arise. I must admit I am still a bit confused about how the scheduling, queuing, inter-node comms, etc still work. Do you have a doc anywhere explaining like model_runner.py and scheduler.py?

@qpwo
Copy link
Contributor Author

qpwo commented Nov 30, 2024

Everything that takes a passthrough has passthrough: Optional[Passthrough] = None, so I think this is a relatively safe PR to merge as-is, despite the big diff. (Biggest thing is I haven't tested non-cuda arches.) I have been using it the last few days and I've found it pretty handy.

My latest pattern is to have like passthrough.action: Save | Steer then in layer.forward I can do if action.type == 'steer': ... and it's a bit more ergonomic than using hooks or wrapper classes for everything. But it's really only practical for running little experiments.

I won't mind at all if you close it -- your project! I am not attached to this

@qpwo
Copy link
Contributor Author

qpwo commented Nov 30, 2024

Oh by the way, the reason I am using aphrodite for my experiments is that it runs twice as fast as the leanest meanest llama inference code I could come up with! I realize that quick experiments are not your main target use case. But thanks a lot for all the cuda kernels

@AlpinDale
Copy link
Member

AlpinDale commented Dec 1, 2024

Would be happy to try and patch it up when issues arise. I must admit I am still a bit confused about how the scheduling, queuing, inter-node comms, etc still work. Do you have a doc anywhere explaining like model_runner.py and scheduler.py?

There's unfortunately no explicit documentation for the internal sequence processing code, I'll get to it as soon as I can. For now, you'd have to rely on the docstrings and comments, which I think should be descriptive enough for most cases.

@50h100a
Copy link
Collaborator

50h100a commented Dec 4, 2024

This PR can be radically simplified by using kwargs unpacking in the model_executable calls in the model runners - no need to make changes to every single model implementation. Just use **passthrough, treating it as a possibly-empty dict of extra args.
As it's a debug feature, failing ungracefully when unsupported kwargs are passed to a model seems acceptable.

I suspect the things upstream can be simplified as well, and certainly generalized into an anonymous dict, but I'm on less familiar ground there.

@qpwo
Copy link
Contributor Author

qpwo commented Dec 5, 2024

Yeah I would've done that if I knew how. I don't really understand the code so I was just mimicking the style of the other stuff passed explicitly. If you did the kwargs thing then the diff might improve my understanding

@qpwo
Copy link
Contributor Author

qpwo commented Dec 5, 2024

I am confused also about why most stuff gets explicitly tensorified/detensorified and the code still ran when i just didn't do that at all. Maybe it relates to torch.compile or XPUs or something?

@qpwo qpwo closed this Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants