-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature :all_but_last cache #254
Feature :all_but_last cache #254
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #254 +/- ##
==========================================
+ Coverage 92.02% 92.03% +0.01%
==========================================
Files 49 49
Lines 4776 4786 +10
==========================================
+ Hits 4395 4405 +10
Misses 381 381 ☔ View full report in Codecov by Sentry. |
So here is the MVP:
What is the conclusion? :) :last is NEVER reading cache? prettymuch... 🤔 |
Here is a refined MVP:
There is ONE thing that amazes me... why :all_but_last has the highest read_cache_input_tokens? I mean :all should be on paar with it, but also have the new tokens written. 🤔 |
Initializing the conversation (with :all)
Next conversaton message comes in:
current solution with
with
|
Thanks for the PR. You are right that the cache behaved slightly differently than I thought, especially the LAST.
The cache must be marked at the same point each time to be written/read. So if you use LAST and continue your conversation (adding message), you'll never have a cache hit -- you keep marking a different message each time. So for the most beginner-friendly application, I've changed that I've also changed the implementation slightly to mark ONLY the user messages, the previous implementation could mark AI messages if you use manual aiprefills (which would be useless for the caching). As for your MWE, besides the above note, it works as expected. The odd artifacts you observed (eg, "all_but_last" having higher cache read tokens) was purely because of the order you ran them in. I minified the example further and added functionality to print the cache markers (0 is system msg): cache = :all
rand_cache_breaker = rand(1:1000000)
sys_msg = SystemMessage("$rand_cache_breaker " * "You are a helpful assistant. "^500)
conversation = [
sys_msg,
UserMessage("What is your favorite color? ONLY write 1 color")
];
conversation = aigenerate(
conversation; model = "claudeh", cache, return_all = true);
# [ Info: Tokens: 24 @ Cost: $0.0 in 1.5 seconds (Metadata: cache_creation_input_tokens => 3003)
conversation = aigenerate(
"Tell me more about the number $(rand(1:1000))"; model = "claudeh", cache,
return_all = true, conversation = conversation);
# [ Info: Tokens: 94 @ Cost: $0.0004 in 3.0 seconds (Metadata: cache_read_input_tokens => 3003, cache_creation_input_tokens => 16)
conversation = aigenerate(
"Tell me more about the number $(rand(1:1000))"; model = "claudeh", cache,
return_all = true, conversation = conversation); Which results in the following with ALL cache:
And this for "all_but_last"
All in all, I've updated the documentation to hopefully make it clearer when to use which:
In addition, I've also:
Does it make sense? |
Going to show an MVP, to demonstrate why it seems like we will need a cache option like this.