-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does the tiktokenizer demo output completely different token numbers than the actual npm package? #12
Comments
Hi @Evertt! The reason why you are seeing different tokens is due to the behaviour of special tokens. By default, no special tokens are being added to the tokeniser. To get the same token counts, you might want to do something like this:
|
Hi @dqbd, I don't know why, but instantiating Tiktoken with an object that maps new special tokens to integers... It actually makes my app crash... This is the code I use: import type { ChatCompletionRequestMessage } from "npm:[email protected]"
import tiktoken, { type TiktokenModel } from "npm:[email protected]"
const messages: ChatCompletionRequestMessage[] = [
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "Hey, can you see my name?", name: "Evert" },
{ role: "assistant", content: "Yes I can, your name is Evert." },
]
// This is just a utility type to extract all the gpt-3.5 and gpt-4 models from the `TiktokenModel` union
type ExtractGPT<T extends string> = T extends `gpt-${infer R}` ? `gpt-${R}` : never
type GPTModel = ExtractGPT<TiktokenModel>
function getChatGPTEncoding(
messages: ChatCompletionRequestMessage[],
model: GPTModel
) {
const isGpt3 = model.startsWith("gpt-3.5")
const msgSep = isGpt3 ? "\n" : ""
const roleSep = isGpt3 ? "\n" : "<|im_sep|>"
return [
messages
.map(({ name, role, content = "" }) => {
return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`
})
.join(msgSep),
`<|im_start|>assistant${roleSep}`,
].join(msgSep)
}
const input = getChatGPTEncoding(messages, "gpt-3.5-turbo")
// See? I added the special tokens just like you recommended
const tik = tiktoken.encoding_for_model("gpt-3.5-turbo", {
"<|im_start|>": 100264,
"<|im_end|>": 100265,
"<|im_sep|>": 100266,
})
const encoded = tik.encode(input)
const decoded = tik.decode(encoded)
const string = new TextDecoder().decode(decoded)
console.log("input:", input)
console.log("encoded:", encoded)
console.log("decoded:", string)
tik.free() It used to work, as in not crash with an error. I just got different integers than the playground gets. But I assume that the token count would probably still be the same? But since I've added that object to map extra special characters, when I run the code I get this error:
EditOh I just realized, I then need to call |
Closing this issue for now, assuming the discussion follows here: dqbd/tiktoken#65 |
So in the tiktokenizer demo, the textarea box looks like this (I'm using the
gpt-3.5-turbo
model):And the token array looks like this:
But when I use the exact same text in my javascript file:
I get completely different tokens as you can see in the terminal panel the right side.
And no, it's not because you see triangles in the string in my screenshot, that's just because of the font I'm using.
Can you explain what's going on here?
The text was updated successfully, but these errors were encountered: