BatchedExecutor: The only sane executor. #804

follyfoxe · 2024-06-27T18:50:21Z

follyfoxe
Jun 27, 2024

Initial pain

I've experimented with LlamaSharp for a while now, checking if any new features were added every once in a while.
But something has always frustrated me everytime I came back to it: The sheer insanity that is to get the high level executors to work.
It's all so, so messy and almost impossible to get working cleanly.
Things like not being able to directly prompt the models with tokens, or how tokenizing special tokens used to be disabled by default (Which would've been fine if only you were able to pass in your own tokens!)

I work on my laptop, which isn't the most powerful machine out there, so I always test small language models like phi-3 and other models whose parameter sizes are around 7b.
Now, these models being so small, are an absolute pain to control. Often times they dont output the end token, or they begin writing user or system messages.
Meaning the context would get all messy with incorrectly generated tokens and no way to correct it because the default executors didn't allow it.
In fact, you couldn't even use antiprompts to fix that! Because first, the generated token would get added to the cache anyways. (At least that's what I think it did by looking at the source code)
And second, the antiprompts were strings and in the code they checked the string output, so special tokens didnt even work!

Updates

Now, some of these issues were slowly improved... But not completely solved.
For example special tokens are now included on the high level executors. Allowing for proper prompt formats like chatml.
And the StreamingTokenDecoder now allows displaying special tokens which is also nice, since before we had to resort to the NativeApi class...
I was devastated.

BatchedExecutor to the rescue

Recently I've decided to give another shot to LLamaSharp, more specifically the new BatchedExecutor which caught my attention.
I looked at the examples and everything slowly clicked into place. I read the source code, and the api was so much clearer.
I proceeded to refactor an old project of mine that used LLamaSharp, and in a matter of minutes, I had phi-3 under control and working as intended and in a much more reliable way than ever before.
To say the least, this thing was incredible.

I really like the fact that I actually got control over the model's output, instead of leaving it to the old executors' messy implementation.
I also like that it doesn't come with some weird state handling shenanigans, that way I can keep track of generated tokens manually. I even came up with a way to handle the context running out, take a look!

// For context: This is inside a custom wrapping class.
// _tokenHistory and _tokenPrompt are both List<LLamaToken>
// _tokenPrompt is modified by various public methods, kind of like the System.Text.StringBuilder api
// _tokensKeep is an int and is set by the user.
void Prompt()
{
    // Add BOS if needed
    if (_tokenHistory.Count <= 0)
        _tokenPrompt.Insert(0, _specialTokens[SpecialToken.BOS]);
    
    // Make sure the prompt doesn't exceed the context size
    int max = (int)Context.ContextSize;
    if (_tokenPrompt.Count > max)
    {
        _tokenPrompt.RemoveRange(0, _tokenPrompt.Count - max);
        Log("Prompt was too big! Consider reducing it...");
    }
    
    // Make sure theres enough space for the prompt
    int target = _tokenHistory.Count + _tokenPrompt.Count;
    if (target > max)
    {
        int remove = target - max;
        _tokenHistory.RemoveRange(_tokensKeep, remove);
        Conversation.ShiftLeft(remove, _tokensKeep); // Really love that this method exists by default.
        Log("Making space...");
    }

    Conversation.Prompt(_tokenPrompt); // We prompt the main conversation
    _tokenHistory.AddRange(_tokenPrompt);
    _tokenPrompt.Clear();
}

I haven't tested it throughly yet, but it just feels right.

Absolutely love the fact that to continue generation, you have to manually prompt the executor with the last generated token. This allows me to actually control the stuff that gets sent to the model. Meaning I can stop the generation when I encounter the model trying to generate user messages when it shouldn't, and the context won't get messy!
This is how it should've been from the start.

My inference loop looks something like this, it's not the best, but it's way cleaner than before.

// For context, these get set to a Dictionary<SpecialToken, LLamaToken> when the model loads.
public enum SpecialToken
{
    BOS,
    System,
    Assistant,
    User,
    EndMessage
}

// Inference
using var sampler = new DefaultSamplingPipeline();
var decoder = new StreamingTokenDecoder(Context) { DecodeSpecialTokens = true };

Prompt();
for (var i = 0; i < MaxTokens; i++)
{
    await Executor.Infer();
    LLamaToken token = sampler.Sample(Context.NativeHandle, Conversation.Sample(), _tokenHistory);
    if (ModelWeights.Tokens.IsEndOfGeneration(token) || IsSpecialToken(token))
        break;

    Write(token);
    Prompt();
    decoder.Add(token);
    
    DoStuff(decoder.Read());
}

Write(SpecialToken.EndMessage); // The next inference will include this.
WriteLine();

These code snippets were taken out of context so don't expect them to work by just copy pasting; They're just for demonstration.

Where are we now

In a more serious note, I decided to make this with the purpose of giving feedback to the project and hopefully improving the experience for developers even more.
I may have exaggerated things a bit but I genuinely think this library has evolved a lot over time and I wish for the best.
Please feel free to correct me on any mistakes or incorrect assumptions that I may have made as I'm not a perfect being.

As for the unfortunate souls that have decided to read this completely: Thank you.
And I have a question for you; Have you had a similar experience with this library? I can't be the only one, right?
I would like to hear your stories!

Best regards, Folly :3

martindevans · 2024-06-27T22:07:35Z

martindevans
Jun 27, 2024
Maintainer

Thank you for posting this! I experienced exactly the same initial pain when I first looked at LlamaSharp about a year ago. That was actually what motivated me to join the project, contribute improvements and eventually develop the BatchedExecutor! I've always been a bit worried it's too "low level" and exposes too much, so it's really good to get feedback like this!

Now that you've used the system, how would you improve it?

My long term plan has always been to build higher level executors over the batched executor (probably wrapping a single conversation object). Does that seem like it would improve things to you?

7 replies

follyfoxe Jun 30, 2024
Author

Definitely agree. In fact I started some work laying the foundations for this with the LlamaTemplate class.

I did look at that class a while ago but decided not to use it since it doesn't allow for generating user responses for example (Which I like to experiment with sometimes)
But I remember looking at the code and seeing it's a llama.cpp limitation, so it doesn't matter that much. Moreover, not being able to do that is good enough for most cases. I'm probably the exception.

Any specific ideas for new examples I could put together?

Perhaps a plain minimal example for general inference, but apart from that nothing comes to mind!

When a batch is run the logits (outputs from the LLM) are made available in a buffer, when you call Sample on a conversation it's really giving you a pointer into that buffer. This is why you must call Sample only after the correct batch has been run - that buffer will have been re-used for other batches if you read from it later, so it'll be meaningless!

This makes sense and I was able to figure it out just by looking at the usage, quite intuitive!

BatchedExecutor internally keeps a queue of unprocessed work (a queue of batches). Once the last batch in the queue fills up, another one is added to the queue and it'll start adding work to that batch. Calling Infer simply takes the next batch from the queue and processes it.

To pre-process tokens you can simply call Infer until the queue is empty, or until the conversation indicates RequiresSampling = false. That will be submitting batches of work as quickly as possible, keeping the GPU as busy as possible.

What if I were to for example, Prompt the conversation with lots of tokens, in such a way that it has to create multiple batches. Doesn't that mean that I would have to call Infer multiple times?
Examples currently call this once, perhaps assuming the prompt will never get that large. This led me to initially think calling Infer would process all the tokens.
If that's the case, and I'm not mistaken, then this detail should be added somewhere, maybe on the method's documentation?

I think the reason the embedder creates a new context at the moment is because Embeddings = true must be set in the params, and I think that used to prevent generating text (i.e. the entire context was in embeddings mode instead of text mode). However, I don't think that's the case any more.

In fact if you're interested in contributing, simply investigating the exact state of embeddings mode and whether or not it interferes with text generation and reporting back what you find would be a great start!

I just did a quick test on the latest LLamaSharp release (0.13.0 as of writing) and can confirm, that setting Embeddings = true does not affect the ability to generate logits. My app works perfectly fine and generates text.
Also, in the property's documentation it says the following If true, extract embeddings (together with logits). (IContextParams)

Unless llama.cpp says otherwise, I believe embeddings mode works as an optional addition :3

martindevans Jun 30, 2024
Maintainer

Doesn't that mean that I would have to call Infer multiple times?

Yes that's correct.

Examples currently call this once, perhaps assuming the prompt will never get that large.

Yep, just laziness on my part when putting the examples together. I think I was trying to keep in simple, but if it's too simple tat it mislead you then I might go through all the examples and fix that!

I believe embeddings mode works as an optional addition

Thanks for checking that 👍

martindevans Jun 30, 2024
Maintainer

Ah, I just dig some more digging into the embeddings thing and it looks like that's about to change. This code was added fairly recently into llama_decode:

    // TODO: use a per-batch flag for logits presence instead
    const bool has_logits = !cparams.embeddings;

and this has been added into the header (I just noticed it while making a start on a new binary update):

    // Set whether the model is in embeddings mode or not
    // If true, embeddings will be returned but logits will not
    LLAMA_API void llama_set_embeddings(struct llama_context * ctx, bool embeddings);

follyfoxe Jul 1, 2024
Author

Yes that's correct.

Alright, going to take that into consideration going forward!

Ah, I just dig some more digging into the embeddings thing and it looks like that's about to change.

Awww... that's a shame.

// TODO: use a per-batch flag for logits presence instead
const bool has_logits = !cparams.embeddings;
const bool has_embd = cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);

It's weird too, because I always thought text models like these needed the embeddings, because they didn't take tokens directly.
Hopefully it'll be possible in the future.

Anyways, thanks for everything!

martindevans Jul 1, 2024
Maintainer

Thinking about this some more - it might be possible to make this all work within one executor.

At the moment the BatchedExecutor has a queue with 2 types of batch: tokens and embeddings (inputs). We could add a third type of batch, Embeddings (outputs), along with a new type of conversation that only allows extracting output embeddings. The context could then be switched in and out of embedding mode (using llama_set_embeddings) per batch as necessary!

SerialKicked · 2024-07-01T16:51:17Z

SerialKicked
Jul 1, 2024

Hey,

I'm a bit confused about the use case for the BatchedExecutor. I get the concept for the others, and I personally use the StatelessExecutor, handling prompt formatting, token count, logging, and instruct formats myself (my system prompt changes dynamically, making stateful executors a bit pointless for me).

From what I gathered from the doc (btw, thank you for having a coherent doc. It's beyond rare in the AI field, sadly), it seems that it's used to process multiple prompts at the same time? Unless it's designed to run on multiple graphic cards, I don't really get the benefit. And if it's sequentially, instead of parallel, I'm literally lost :D

I get it's a lower level way to access the model, but I don't understand the use case. I'm just curious as I'm fundamentally unable to learn something without a concrete use case / example, no matter the amount of documentation available. I maybe could see it for when i'm running a small embedding model to feed a ton of message history and arbitrary text into a Vector DB, at vaguely higher speed than when using Stateless in a loop, but if feel like I'm missing something important.

10 replies

webitube Aug 21, 2024

@martindevans

I created a very simple grammar to return an answer to a question in the following form:
{"answer":"answer text"}

grammar:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "required": [
        "answer"
    ],
    "properties": {
        "answer": {
            "type": "string"
        }
    }
}

The response I get back is just a long string of, "{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{".

I finally got my code working. I'm still distilling the changes to just the necessary ones, but, after reading BoolQ and BatchedExecutorGuidance (on which my work is based), I think I know what the problem probably is.

I think the problem is in BaseSamplingPipeline.Sample(). BatchedExecutorGuidance (DefaultSamplingPipeline : BaseSamplingPipeline.Sample()) which never calls Accept on the output token whereas BoolQ does this manually in Prompt().

Specifically, I suspect the problem is in the last line of BaseSamplingPipeline.Sample():
return ProcessTokenDataArray(ctx, candidates, lastTokens);

In my version, I changed this to:

   var token = ProcessTokenDataArray(ctx, candidates, lastTokens);
   Accept(ctx, token);
   return token;

Note that my version has many other changes, but my best guess right now is not Accepting the token is the issue.

I'll report back after I confirm my findings.

martindevans Aug 21, 2024
Maintainer

Ah yes without a call to Accept the grammar doesn't know that a token has been picked, so it keeps trying to pick the first one. Accept is meant to be called by whatever is using the pipeline, rather than the pipeline itself. For example you can see the StatelessExecutor doing tha here.

webitube Aug 21, 2024

Thanks @martindevans! I understand now.

To sum up looking at the original BatchedExecutorGuidance, the problem I had with adding Grammar support is twofold. First, the token needed to be accepted after Sample(). I added a separate pipeline.Accept() right after the Sample() on the example application side. Second, the example pipeline used, GuidedSampler, overrides Accept() to do nothing. Once this override is removed, it works fine.

Should BatchedExecutorGuidance be updated to do the Accept() and remove the GuidedSampler.Accept() override? It works fine as-is, but it could be confusing to future users/readers if they want to enable constrained output.

martindevans Aug 21, 2024
Maintainer

Yeah I'm not sure why Accept is being overridden there. I think possibly it used to be an abstract method (so I just implemented it to do nothing, since it's not relevant to guidance), but now it's just a bit confusing! If you'd like to submit a PR removing that I'll be happy to review and merge that :)

webitube Aug 21, 2024

@martindevans I created a PR for this change: #904

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchedExecutor: The only sane executor. #804

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 17 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

BatchedExecutor: The only sane executor. #804

follyfoxe Jun 27, 2024

Initial pain

Updates

BatchedExecutor to the rescue

Where are we now

Replies: 2 comments · 17 replies

martindevans Jun 27, 2024 Maintainer

follyfoxe Jun 30, 2024 Author

martindevans Jun 30, 2024 Maintainer

martindevans Jun 30, 2024 Maintainer

follyfoxe Jul 1, 2024 Author

martindevans Jul 1, 2024 Maintainer

SerialKicked Jul 1, 2024

webitube Aug 21, 2024

martindevans Aug 21, 2024 Maintainer

webitube Aug 21, 2024

martindevans Aug 21, 2024 Maintainer

webitube Aug 21, 2024

follyfoxe
Jun 27, 2024

Replies: 2 comments 17 replies

martindevans
Jun 27, 2024
Maintainer

follyfoxe Jun 30, 2024
Author

martindevans Jun 30, 2024
Maintainer

martindevans Jun 30, 2024
Maintainer

follyfoxe Jul 1, 2024
Author

martindevans Jul 1, 2024
Maintainer

SerialKicked
Jul 1, 2024

martindevans Aug 21, 2024
Maintainer

martindevans Aug 21, 2024
Maintainer