How can I get the correct output at the edge? #15

gablabelle · 2023-08-19T21:08:05Z

Using issues for a question... Sorry about that.

When using the following the output doesn't match the output from the online tiktokenizer and it has a different length :

      import model from "tiktoken/encoders/cl100k_base";
      import { init, Tiktoken } from "tiktoken/lite/init";
      // @ts-expect-error
      import wasm from "tiktoken/lite/tiktoken_bg.wasm?module";

      export const runtime = "edge";
      // ...

      await init((imports) => WebAssembly.instantiate(wasm, imports));
      const inputText = getChatGPTEncoding(messages, "gpt-3.5-turbo");
      const encoding = new Tiktoken(
        model.bpe_ranks,
        model.special_tokens,
        model.pat_str
      );
      const tokens = encoding.encode(inputText);
      encoding.free();
      return new Response(`${tokens}`);

For the following input (What is saved into the inputText variable):

<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nHi<|im_end|>\n<|im_start|>assistant\n

I get following tokens for gpt-3.5-turbo at https://tiktokenizer.vercel.app/ :

[100264, 882, 1734, 15339, 100265, 1734, 100264, 78191, 1734, 9906, 0, 2650, 649, 358, 7945, 499, 3432, 30, 100265, 1734, 100264, 882, 1734, 13347, 100265, 1734, 100264, 78191, 1734]

But when running the code I get the following tokens:

[27,91,318,5011,91,29,882,198,15339,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198,9906,0,2650,649,358,7945,499,3432,76514,91,318,6345,91,397,27,91,318,5011,91,29,882,198,13347,27,91,318,6345,91,397,27,91,318,5011,91,29,78191,198]

The text was updated successfully, but these errors were encountered:

gablabelle · 2023-08-23T00:59:53Z

After reading a few issues at #12 and dqbd/tiktoken#65 here is what worked:

      const encoding = new Tiktoken(
        model.bpe_ranks,
        {
          ...model.special_tokens,
          "<|im_start|>": 100264,
          "<|im_end|>": 100265,
          "<|im_sep|>": 100266,
        },
        model.pat_str
      );
      const tokens = encoding.encode(inputText, "all");

gablabelle changed the title ~~How can I get encoding_for_model at the edge?~~ How can I get the correct output at the edge? Aug 21, 2023

gablabelle closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get the correct output at the edge? #15

How can I get the correct output at the edge? #15

gablabelle commented Aug 19, 2023 •

edited

Loading

gablabelle commented Aug 23, 2023 •

edited

Loading

How can I get the correct output at the edge? #15

How can I get the correct output at the edge? #15

Comments

gablabelle commented Aug 19, 2023 • edited Loading

gablabelle commented Aug 23, 2023 • edited Loading

gablabelle commented Aug 19, 2023 •

edited

Loading

gablabelle commented Aug 23, 2023 •

edited

Loading