docs: ✏️ add tokenizers reference documentation

microsoft · Oct 23, 2024 · c5c424f · c5c424f
1 parent d42dbb7
commit c5c424f
Showing 1 changed file with 58 additions and 0 deletions.
diff --git a/docs/src/content/docs/reference/scripts/tokenizers.md b/docs/src/content/docs/reference/scripts/tokenizers.md
@@ -0,0 +1,58 @@
+---
+title: Tokenizers
+description: Tokenizers are used to split text into tokens.
+sidebar:
+    order: 60
+---
+
+The `tokenizers` helper module providers a set of functions to split text into tokens.
+
+```ts
+const n = tokenizers.count("hello world")
+```
+
+## Choosing your tokenizer
+
+By default, the `tokenizers` module uses the `large` tokenizer. You can change the tokenizer by passing the model identifier.
+
+```ts 'model: "gpt-4o-mini"'
+const n = await tokenizers.count("hello world", { model: "gpt-4o-mini" })
+```
+
+## `count`
+
+Counts the number of tokens in a string.
+
+```ts wrap
+const n = await tokenizers.count("hello world")
+```
+
+## `truncate`
+
+Drops a part of the string to fit into a token budget
+
+```ts wrap
+const truncated = await tokenizers.truncate("hello world", 5)
+```
+
+## `chunk`
+
+Splits the text into chunks of a given token size. The chunk tries to find
+appropriate chunking boundaries based on the document type.
+
+```ts
+const chunks = await tokenizers.chunk(env.files[0])
+for(const chunk of chunks) {
+    ...
+}
+```
+
+You can configure the chunking size, overlap and add line numbers.
+
+```ts wrap
+const chunks = await tokenizers.chunk(env.files[0], {
+    chunkSize: 128,
+    chunkOverlap 10,
+    lineNumbers: true
+})
+```