Replies: 2 comments 17 replies
-
Thank you for posting this! I experienced exactly the same initial pain when I first looked at LlamaSharp about a year ago. That was actually what motivated me to join the project, contribute improvements and eventually develop the Now that you've used the system, how would you improve it? My long term plan has always been to build higher level executors over the batched executor (probably wrapping a single conversation object). Does that seem like it would improve things to you? |
Beta Was this translation helpful? Give feedback.
-
Hey, I'm a bit confused about the use case for the BatchedExecutor. I get the concept for the others, and I personally use the StatelessExecutor, handling prompt formatting, token count, logging, and instruct formats myself (my system prompt changes dynamically, making stateful executors a bit pointless for me). From what I gathered from the doc (btw, thank you for having a coherent doc. It's beyond rare in the AI field, sadly), it seems that it's used to process multiple prompts at the same time? Unless it's designed to run on multiple graphic cards, I don't really get the benefit. And if it's sequentially, instead of parallel, I'm literally lost :D I get it's a lower level way to access the model, but I don't understand the use case. I'm just curious as I'm fundamentally unable to learn something without a concrete use case / example, no matter the amount of documentation available. I maybe could see it for when i'm running a small embedding model to feed a ton of message history and arbitrary text into a Vector DB, at vaguely higher speed than when using Stateless in a loop, but if feel like I'm missing something important. |
Beta Was this translation helpful? Give feedback.
-
Initial pain
I've experimented with LlamaSharp for a while now, checking if any new features were added every once in a while.
But something has always frustrated me everytime I came back to it: The sheer insanity that is to get the high level executors to work.
It's all so, so messy and almost impossible to get working cleanly.
Things like not being able to directly prompt the models with tokens, or how tokenizing special tokens used to be disabled by default (Which would've been fine if only you were able to pass in your own tokens!)
I work on my laptop, which isn't the most powerful machine out there, so I always test small language models like phi-3 and other models whose parameter sizes are around 7b.
Now, these models being so small, are an absolute pain to control. Often times they dont output the end token, or they begin writing user or system messages.
Meaning the context would get all messy with incorrectly generated tokens and no way to correct it because the default executors didn't allow it.
In fact, you couldn't even use antiprompts to fix that! Because first, the generated token would get added to the cache anyways. (At least that's what I think it did by looking at the source code)
And second, the antiprompts were strings and in the code they checked the string output, so special tokens didnt even work!
Updates
Now, some of these issues were slowly improved... But not completely solved.
For example special tokens are now included on the high level executors. Allowing for proper prompt formats like chatml.
And the StreamingTokenDecoder now allows displaying special tokens which is also nice, since before we had to resort to the NativeApi class...
I was devastated.
BatchedExecutor to the rescue
Recently I've decided to give another shot to LLamaSharp, more specifically the new BatchedExecutor which caught my attention.
I looked at the examples and everything slowly clicked into place. I read the source code, and the api was so much clearer.
I proceeded to refactor an old project of mine that used LLamaSharp, and in a matter of minutes, I had phi-3 under control and working as intended and in a much more reliable way than ever before.
To say the least, this thing was incredible.
I really like the fact that I actually got control over the model's output, instead of leaving it to the old executors' messy implementation.
I also like that it doesn't come with some weird state handling shenanigans, that way I can keep track of generated tokens manually. I even came up with a way to handle the context running out, take a look!
I haven't tested it throughly yet, but it just feels right.
Absolutely love the fact that to continue generation, you have to manually prompt the executor with the last generated token. This allows me to actually control the stuff that gets sent to the model. Meaning I can stop the generation when I encounter the model trying to generate user messages when it shouldn't, and the context won't get messy!
This is how it should've been from the start.
My inference loop looks something like this, it's not the best, but it's way cleaner than before.
These code snippets were taken out of context so don't expect them to work by just copy pasting; They're just for demonstration.
Where are we now
In a more serious note, I decided to make this with the purpose of giving feedback to the project and hopefully improving the experience for developers even more.
I may have exaggerated things a bit but I genuinely think this library has evolved a lot over time and I wish for the best.
Please feel free to correct me on any mistakes or incorrect assumptions that I may have made as I'm not a perfect being.
As for the unfortunate souls that have decided to read this completely: Thank you.
And I have a question for you; Have you had a similar experience with this library? I can't be the only one, right?
I would like to hear your stories!
Best regards, Folly :3
Beta Was this translation helpful? Give feedback.
All reactions