Feature :all_but_last cache #254

Sixzero · 2024-12-10T02:36:42Z

Going to show an MVP, to demonstrate why it seems like we will need a cache option like this.

… mode.

codecov · 2024-12-10T02:41:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.03%. Comparing base (1208290) to head (715a1b7).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #254      +/-   ##
==========================================
+ Coverage   92.02%   92.03%   +0.01%     
==========================================
  Files          49       49              
  Lines        4776     4786      +10     
==========================================
+ Hits         4395     4405      +10     
  Misses        381      381

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sixzero · 2024-12-10T09:56:30Z

So here is the MVP:

using Test
using EasyContext
using PromptingTools: SystemMessage, UserMessage, AIMessage
using PromptingTools

rand_cache_breaker = rand(1:1000000)

# provide enough tokens to trigger cache.
sys_msg = SystemMessage("$rand_cache_breaker " *"You are a helpful assistant. " ^ 500)
# Create conversation with multiple messages
conversation = [
    sys_msg,
    UserMessage("What is your favorite color? ONLY write 1 color"),
]

msg = aigenerate(
    conversation;
    model="claudeh",
    cache=:all,
)
push!(conversation, AIMessage(msg.content))

@testset "Test continuation with :system, :last, :all_but_last, :all" begin
    msg1 = aigenerate(
        [conversation..., UserMessage("1. Tell me 3 times the percentage how much you like it.")];
        model="claudeh",
        cache=:system,
    )
    @test msg1.extras[:cache_creation_input_tokens] == 0
    msg2 = aigenerate(
        [conversation..., UserMessage("2. Tell me 3 times the percentage how much you like it.")];
        model="claudeh",
        cache=:last,
    )
    # this haven't even used cache_read_input_tokens ! Pretty much no reason to use.
    @test msg2.extras[:cache_creation_input_tokens] < 200
    msg3 = aigenerate(
        [conversation..., UserMessage("3. Tell me 3 times the percentage how much you like it.")];
        model="claudeh",
        cache=:all_but_last,
    )
    @test msg3.extras[:cache_creation_input_tokens] == 0
    # actually all_but_last saved more tokens than :system
    @test msg3.extras[:cache_read_input_tokens] > msg1.extras[:cache_read_input_tokens]
    msg4 = aigenerate(
        [conversation..., UserMessage("4. Tell me 3 times the percentage how much you like it.")];
        model="claudeh",
        cache=:all,
    )
    @test msg4.extras[:cache_creation_input_tokens] > 0
end

What is the conclusion? :)

:last is NEVER reading cache? prettymuch... 🤔
:all_but_last does exactly what we need. But just realised if we don't add ephemeral to last aimessage, we can spare 4 more cache_write, because that causes 4 cache writes. 🤔
:all does what it supposed to do. It reads existing cache, and writes the new tokens (what is intended).
:system does what it supposed to do also.

Sixzero · 2024-12-10T10:39:09Z

Here is a refined MVP:

where you can make the conversation longer more easily
temperature = 0 for less randomness.

using Test
using EasyContext
using PromptingTools: SystemMessage, UserMessage, AIMessage
using PromptingTools

rand_cache_breaker = rand(1:1000000)

# provide enough tokens to trigger cache.
sys_msg = SystemMessage("$rand_cache_breaker " *"You are a helpful assistant. " ^ 400)
# Create conversation with multiple messages
conversation = [
    sys_msg,
    UserMessage("What is your favorite color? ONLY write 1 color"),
]

msg = aigenerate(
    conversation;
    model="claudeh",
    cache=:all,
    verbose=false,
    api_kwargs=(;temperature=0.0)
)
push!(conversation, AIMessage(msg.content))
function add_conv!(conversation)
    push!(conversation, UserMessage("Tell me once more."))
    msg = aigenerate(
        conversation;
        model="claudeh",
        cache=:all,
        api_kwargs=(;temperature=0.0)
    )
    push!(conversation, AIMessage(msg.content))
end
add_conv!(conversation)
add_conv!(conversation)
add_conv!(conversation)
println("Tests")
@testset "Test continuation with :system, :last, :all_but_last, :all" begin
    msg1 = aigenerate(
        [conversation..., UserMessage("1. Tell me 10 times just the percentage how much you like it.")];
        model="claudeh",
        cache=:system,
        api_kwargs=(;temperature=0.0)
    )
    @show msg1.content
    @test msg1.extras[:cache_creation_input_tokens] == 0
    msg2 = aigenerate(
        [conversation..., UserMessage("2. Tell me 3 times the percentage how much you like it.")];
        model="claudeh",
        cache=:last,
    )
    # this haven't even used cache_read_input_tokens ! Pretty much no reason to use. THIS IS NOT GOOD!
    # @test msg2.extras[:cache_creation_input_tokens] < 200
    msg3 = aigenerate(
        [conversation..., UserMessage("3. Tell me 10 times just the percentage how much you like it.")];
        model="claudeh",
        cache=:all_but_last,
        api_kwargs=(;temperature=0.0)
    )
    @show msg3.content
    @test msg3.extras[:cache_creation_input_tokens] == 0
    # # actually all_but_last saved more tokens than :system
    # @test msg3.extras[:cache_read_input_tokens] > msg1.extras[:cache_read_input_tokens]
    msg4 = aigenerate(
        [conversation..., UserMessage("4. Tell me 10 times just the percentage how much you like it.")];
        model="claudeh",
        cache=:all,
        api_kwargs=(;temperature=0.0)
    )
    @show msg4.content
    @test msg4.extras[:cache_creation_input_tokens] > 0
end

There is ONE thing that amazes me... why :all_but_last has the highest read_cache_input_tokens? I mean :all should be on paar with it, but also have the new tokens written. 🤔

Sixzero · 2024-12-10T11:09:36Z

Initializing the conversation (with :all)

System < cache
User
AI
User < cache (going to write everything to this point)

Next conversaton message comes in:

current solution with :last:

System
User
AI
User (not going to read cache to this point)
AI
User < cache (This will be written to cache, with the FULL conversation, even if we had that previous cache point !)

current solution with :all:

System < cache (it is going to be read, since we have this saved in initialization)
User
AI
User (no ephemeral here, so it won't read to this cache point!!!)
AI
User < cache (To write this cache point)

with :all_but_last:

System < cache (we can do this, not really needed tho)
User
AI
User < cache (this actually gets used if it has been written! that is why we needed this!)
AI
User (not gonna cache and pay +25% for no reason.)

…message.

svilupp · 2024-12-10T20:15:27Z

Thanks for the PR.

You are right that the cache behaved slightly differently than I thought, especially the LAST.
It's because of this rule:

Ensure cached sections are identical and marked with cache_control in the same locations across calls
Source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

The cache must be marked at the same point each time to be written/read. So if you use LAST and continue your conversation (adding message), you'll never have a cache hit -- you keep marking a different message each time.

So for the most beginner-friendly application, I've changed that cache=:all writes BOTH last and one before the last user message to maximize the cache read in multi-turn conversation.

I've also changed the implementation slightly to mark ONLY the user messages, the previous implementation could mark AI messages if you use manual aiprefills (which would be useless for the caching).

As for your MWE, besides the above note, it works as expected. The odd artifacts you observed (eg, "all_but_last" having higher cache read tokens) was purely because of the order you ran them in.

I minified the example further and added functionality to print the cache markers (0 is system msg):

cache = :all
rand_cache_breaker = rand(1:1000000)
sys_msg = SystemMessage("$rand_cache_breaker " * "You are a helpful assistant. "^500)
conversation = [
    sys_msg,
    UserMessage("What is your favorite color? ONLY write 1 color")
];
conversation = aigenerate(
    conversation; model = "claudeh", cache, return_all = true);
# [ Info: Tokens: 24 @ Cost: $0.0 in 1.5 seconds (Metadata: cache_creation_input_tokens => 3003)
conversation = aigenerate(
    "Tell me more about the number $(rand(1:1000))"; model = "claudeh", cache,
    return_all = true, conversation = conversation);
# [ Info: Tokens: 94 @ Cost: $0.0004 in 3.0 seconds (Metadata: cache_read_input_tokens => 3003, cache_creation_input_tokens => 16)
conversation = aigenerate(
    "Tell me more about the number $(rand(1:1000))"; model = "claudeh", cache,
    return_all = true, conversation = conversation);

Which results in the following with ALL cache:

## "All" in a growing multi-turn conversation
# [ Info: Cache control markers at message positions: [0, 1]
# [ Info: Tokens: 8 @ Cost: $0.0 in 1.7 seconds (Metadata: cache_creation_input_tokens => 3019)
# [ Info: Cache control markers at message positions: [0, 1, 3]
# [ Info: Tokens: 72 @ Cost: $0.0003 in 2.3 seconds (Metadata: cache_read_input_tokens => 3019, cache_creation_input_tokens => 15)
# [ Info: Cache control markers at message positions: [0, 3, 5]
# [ Info: Tokens: 101 @ Cost: $0.0005 in 2.5 seconds (Metadata: cache_read_input_tokens => 3034, cache_creation_input_tokens => 79)

And this for "all_but_last"

## "All but last" in a growing multi-turn conversation
# [ Info: Cache control markers at message positions: [0]
# [ Info: Tokens: 24 @ Cost: $0.0 in 1.0 seconds (Metadata: cache_creation_input_tokens => 3003)
# [ Info: Cache control markers at message positions: [0, 1]
# [ Info: Tokens: 101 @ Cost: $0.0004 in 2.6 seconds (Metadata: cache_read_input_tokens => 3003, cache_creation_input_tokens => 16)
# [ Info: Cache control markers at message positions: [0, 3]
# [ Info: Tokens: 213 @ Cost: $0.0007 in 3.5 seconds (Metadata: cache_read_input_tokens => 3003, cache_creation_input_tokens => 31)

All in all, I've updated the documentation to hopefully make it clearer when to use which:

- `cache`: A symbol representing the caching strategy to be used. Currently only `nothing` (no caching), `:system`, `:tools`,`:last`, `:all_but_last`, and `:all` are supported.
    - `:system`: Mark only the system message as cacheable. Best default if you have large system message and you will be sending short conversations (no replies / multi-turn conversations).
    - `:all`: Mark SYSTEM, one before last and LAST user message as cacheable. Best for multi-turn conversations (you write cache point as "last" and it will be read in the next turn as "preceding" cache mark).
    - `:last`: Mark only the last message as cacheable. Use ONLY if you want to send the SAME REQUEST multiple times (and want to save upto the last USER message). This will not work for multi-turn conversations, as the "last" message keeps moving.
    - `:all_but_last`: Mark SYSTEM and one before LAST USER message. Use if you have a longer conversation that you want to re-use, but you will NOT CONTINUE it (no subsequent messages/follow-ups).
    - In short, use `:all` for multi-turn conversations, `:system` for repeated single-turn conversations with same system message, and `:all_but_last` for longer conversations that you want to re-use, but not continue.

In addition, I've also:

updated the documentation, cache assertions, etc. across all ai* functions (they can also use cache, not just aigenerate)
updated the tests to capture the new symbol and changes in behavior / refinement (user-messages only)

Does it make sense?

Using last cache, but not writing new cache with \:all_but_last cache…

ef1619d

… mode.

Sixzero and others added 2 commits December 10, 2024 12:40

:all_but_last "ephemeral"s the previous user message not the last ai …

fc6a803

…message.

update the all_but_last implementation

715a1b7

svilupp merged commit 50f0475 into svilupp:main Dec 10, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature :all_but_last cache #254

Feature :all_but_last cache #254

Sixzero commented Dec 10, 2024

codecov bot commented Dec 10, 2024 •

edited

Loading

Sixzero commented Dec 10, 2024

Sixzero commented Dec 10, 2024

Sixzero commented Dec 10, 2024 •

edited

Loading

svilupp commented Dec 10, 2024 •

edited

Loading

Feature :all_but_last cache #254

Feature :all_but_last cache #254

Conversation

Sixzero commented Dec 10, 2024

codecov bot commented Dec 10, 2024 • edited Loading

Codecov Report

Sixzero commented Dec 10, 2024

Sixzero commented Dec 10, 2024

Sixzero commented Dec 10, 2024 • edited Loading

svilupp commented Dec 10, 2024 • edited Loading

codecov bot commented Dec 10, 2024 •

edited

Loading

Sixzero commented Dec 10, 2024 •

edited

Loading

svilupp commented Dec 10, 2024 •

edited

Loading