Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Closed
Evertt opened this issue Jul 3, 2023 · 3 comments

Comments

@Evertt
Copy link

Evertt commented Jul 3, 2023

So in the tiktokenizer demo, the textarea box looks like this (I'm using the gpt-3.5-turbo model):

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Yes, please help<|im_end|>
<|im_start|>assistant

And the token array looks like this:

[100264, 9125, 198, 2675, 527, 264, 11190, 18328, 100265, 198, 100264, 882, 198, 9642, 11, 4587, 1520, 100265, 198, 100264, 78191, 198]

But when I use the exact same text in my javascript file:

image

I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.

Can you explain what's going on here?

@dqbd
Copy link
Owner

dqbd commented Jul 4, 2023

Hi @Evertt!

The reason why you are seeing different tokens is due to the behaviour of special tokens. By default, no special tokens are being added to the tokeniser.

To get the same token counts, you might want to do something like this:

if (model.startsWith("gpt-4") || model.startsWith("gpt-3.5-turbo")) {
  return encoding_for_model(model, {
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  });
}

@Evertt
Copy link
Author

Evertt commented Jul 8, 2023

Hi @dqbd, I don't know why, but instantiating Tiktoken with an object that maps new special tokens to integers... It actually makes my app crash...

This is the code I use:

import type { ChatCompletionRequestMessage } from "npm:[email protected]"
import tiktoken, { type TiktokenModel } from "npm:[email protected]"

const messages: ChatCompletionRequestMessage[] = [
  { role: "system", content: "You are a helpful assistant" },
  { role: "user", content: "Hey, can you see my name?", name: "Evert" },
  { role: "assistant", content: "Yes I can, your name is Evert." },
]

// This is just a utility type to extract all the gpt-3.5 and gpt-4 models from the `TiktokenModel` union
type ExtractGPT<T extends string> = T extends `gpt-${infer R}` ? `gpt-${R}` : never
type GPTModel = ExtractGPT<TiktokenModel>

function getChatGPTEncoding(
  messages: ChatCompletionRequestMessage[],
  model: GPTModel
) {
  const isGpt3 = model.startsWith("gpt-3.5")

  const msgSep = isGpt3 ? "\n" : ""
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>"

  return [
    messages
      .map(({ name, role, content = "" }) => {
        return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep)
}

const input = getChatGPTEncoding(messages, "gpt-3.5-turbo")

// See? I added the special tokens just like you recommended
const tik = tiktoken.encoding_for_model("gpt-3.5-turbo", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
  "<|im_sep|>": 100266,
})

const encoded = tik.encode(input)
const decoded = tik.decode(encoded)

const string = new TextDecoder().decode(decoded)
console.log("input:", input)
console.log("encoded:", encoded)
console.log("decoded:", string)
tik.free()

It used to work, as in not crash with an error. I just got different integers than the playground gets. But I assume that the token count would probably still be the same?

But since I've added that object to map extra special characters, when I run the code I get this error:

Uncaught (in promise) Error: The text contains a special token that is not allowed: <|im_start|>

Edit

Oh I just realized, I then need to call tik.encode(input) like so tik.encode(input, "all", []).

@dqbd
Copy link
Owner

dqbd commented Aug 7, 2023

Closing this issue for now, assuming the discussion follows here: dqbd/tiktoken#65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants