Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I look at cl100k_base.json, I don't see the special tokens like <|im_start|>, but when I add them manually I get an error from the wasm #65

Open
Evertt opened this issue Aug 5, 2023 · 2 comments

Comments

@Evertt
Copy link

Evertt commented Aug 5, 2023

Using:

const tik = tiktoken.encoding_for_model("gpt-3.5-turbo", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
  "<|im_sep|>": 100266,
})

And then trying to encode this string:

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>Evert
Hey, can you see my name?<|im_end|>
<|im_start|>assistant
Yes I can, your name is Evert.<|im_end|>
<|im_start|>assistant

I get the following error:

error: Uncaught Error: The text contains a special token that is not allowed: <|im_start|>
    at module.exports.__wbindgen_error_new (file:///Users/evert/Library/Caches/deno/npm/registry.npmjs.org/tiktoken/1.0.10/tiktoken_bg.cjs:418:17)
    at <anonymous> (wasm://wasm/00b9ab1a:1:100710)
    at <anonymous> (wasm://wasm/00b9ab1a:1:314157)
    at Tiktoken.encode (file:///Users/evert/Library/Caches/deno/npm/registry.npmjs.org/tiktoken/1.0.10/tiktoken_bg.cjs:270:18)
    at file:///Users/evert/Sites/chat-nvc-telegram-bot/src/tokenizer.ts:42:21

You'd think that that might imply that the wasm already contains those special tokens. However, when I try to encode that string without adding those special tokens manually, then the output is this:

encoded: Uint32Array(72) [
    27,   91,   318,  5011,    91,   29,  9125,   198,
  2675,  527,   264, 11190, 18328,   27,    91,   318,
  6345,   91,   397,    27,    91,  318,  5011,    91,
    29,   36,  1653,   198, 19182,   11,   649,   499,
  1518,  856,   836, 76514,    91,  318,  6345,    91,
   397,   27,    91,   318,  5011,   91,    29, 78191,
   198, 9642,   358,   649,    11,  701,   836,   374,
   469, 1653, 16134,    91,   318, 6345,    91,   397,
    27,   91,   318,  5011,    91,   29, 78191,   198
]

Which doesn't line up with what I get in the tiktokenizer demo on vercel:

image
@Evertt
Copy link
Author

Evertt commented Aug 5, 2023

Never mind, apparently I had to call const encoded = tik.encode(input, "all") instead of const encoded = tik.encode(input).

I'm still a little confused why I need to do that though. And whether "all" is really the best option. Should I use "all" or should I use ["<|im_start|>", "<|im_end|>", "<|im_sep|>"] ?

@dqbd
Copy link
Owner

dqbd commented Aug 7, 2023

Hi @Evertt!

The reason why all is optional here is to mimic the same behaviour as seen in the official openai/tiktoken library, where it is assumed that we're encoding user input directly.

all will include other special tokens as well, which may not be desired, depending on the input being received. Namely:

"<|endoftext|>": 100257,
"<|fim_prefix|>": 100258,
"<|fim_middle|>": 100259,
"<|fim_suffix|>": 100260,
"<|endofprompt|>": 100276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants