Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get the correct output at the edge? #15

Closed
gablabelle opened this issue Aug 19, 2023 · 1 comment
Closed

How can I get the correct output at the edge? #15

gablabelle opened this issue Aug 19, 2023 · 1 comment

Comments

@gablabelle
Copy link

gablabelle commented Aug 19, 2023

Using issues for a question... Sorry about that.

When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :

      import model from "tiktoken/encoders/cl100k_base";
      import { init, Tiktoken } from "tiktoken/lite/init";
      // @ts-expect-error
      import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";

      export const runtime = "edge";
      // ...

      await init((imports) => WebAssembly.instantiate(wasm, imports));
      const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
      const encoding = new Tiktoken(
        model.bpe_ranks,
        model.special_tokens,
        model.pat_str
      );
      const tokens = encoding.encode(inputText);
      encoding.free();
      return new Response(`${tokens}`);

For the following input (What is saved into the inputText variable):

<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n

I get following tokens for gpt-3.5-turbo at https://tiktokenizer.vercel.app/ :

[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]

But when running the code I get the following tokens:

[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]
@gablabelle gablabelle changed the title How can I get encoding_for_model at the edge? How can I get the correct output at the edge? Aug 21, 2023
@gablabelle
Copy link
Author

gablabelle commented Aug 23, 2023

After reading a few issues at #12 and dqbd/tiktoken#65 here is what worked:

      const encoding = new Tiktoken(
        model.bpe_ranks,
        {
          ...model.special_tokens,
          "<|im_start|>": 100264,
          "<|im_end|>": 100265,
          "<|im_sep|>": 100266,
        },
        model.pat_str
      );
      const tokens = encoding.encode(inputText, "all");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant