Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Evertt · 2023-07-03T12:56:59Z

So in the tiktokenizer demo, the textarea box looks like this (I'm using the gpt-3.5-turbo model):

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Yes, please help<|im_end|>
<|im_start|>assistant

And the token array looks like this:

[100264, 9125, 198, 2675, 527, 264, 11190, 18328, 100265, 198, 100264, 882, 198, 9642, 11, 4587, 1520, 100265, 198, 100264, 78191, 198]

But when I use the exact same text in my javascript file:

I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.

Can you explain what's going on here?

The text was updated successfully, but these errors were encountered:

dqbd · 2023-07-04T07:48:20Z

Hi @Evertt!

The reason why you are seeing different tokens is due to the behaviour of special tokens. By default, no special tokens are being added to the tokeniser.

To get the same token counts, you might want to do something like this:

if (model.startsWith("gpt-4") || model.startsWith("gpt-3.5-turbo")) {
  return encoding_for_model(model, {
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  });
}

Evertt · 2023-07-08T02:38:07Z

Hi @dqbd, I don't know why, but instantiating Tiktoken with an object that maps new special tokens to integers... It actually makes my app crash...

This is the code I use:

import type { ChatCompletionRequestMessage } from "npm:[email protected]"
import tiktoken, { type TiktokenModel } from "npm:[email protected]"

const messages: ChatCompletionRequestMessage[] = [
  { role: "system", content: "You are a helpful assistant" },
  { role: "user", content: "Hey, can you see my name?", name: "Evert" },
  { role: "assistant", content: "Yes I can, your name is Evert." },
]

// This is just a utility type to extract all the gpt-3.5 and gpt-4 models from the `TiktokenModel` union
type ExtractGPT<T extends string> = T extends `gpt-${infer R}` ? `gpt-${R}` : never
type GPTModel = ExtractGPT<TiktokenModel>

function getChatGPTEncoding(
  messages: ChatCompletionRequestMessage[],
  model: GPTModel
) {
  const isGpt3 = model.startsWith("gpt-3.5")

  const msgSep = isGpt3 ? "\n" : ""
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>"

  return [
    messages
      .map(({ name, role, content = "" }) => {
        return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep)
}

const input = getChatGPTEncoding(messages, "gpt-3.5-turbo")

// See? I added the special tokens just like you recommended
const tik = tiktoken.encoding_for_model("gpt-3.5-turbo", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
  "<|im_sep|>": 100266,
})

const encoded = tik.encode(input)
const decoded = tik.decode(encoded)

const string = new TextDecoder().decode(decoded)
console.log("input:", input)
console.log("encoded:", encoded)
console.log("decoded:", string)
tik.free()

It used to work, as in not crash with an error. I just got different integers than the playground gets. But I assume that the token count would probably still be the same?

But since I've added that object to map extra special characters, when I run the code I get this error:

Uncaught (in promise) Error: The text contains a special token that is not allowed: <|im_start|>

Edit

Oh I just realized, I then need to call tik.encode(input) like so tik.encode(input, "all", []).

dqbd · 2023-08-07T23:15:40Z

Closing this issue for now, assuming the discussion follows here: dqbd/tiktoken#65

dqbd closed this as completed Aug 7, 2023

gablabelle mentioned this issue Aug 23, 2023

How can I get the correct output at the edge? #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Evertt commented Jul 3, 2023 •

edited

Loading

dqbd commented Jul 4, 2023

Evertt commented Jul 8, 2023 •

edited

Loading

dqbd commented Aug 7, 2023

Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12

Comments

Evertt commented Jul 3, 2023 • edited Loading

dqbd commented Jul 4, 2023

Evertt commented Jul 8, 2023 • edited Loading

Edit

dqbd commented Aug 7, 2023

Evertt commented Jul 3, 2023 •

edited

Loading

Evertt commented Jul 8, 2023 •

edited

Loading