Skip to content

Commit

Permalink
docs: ✏️ add tokenizers reference documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
pelikhan committed Oct 23, 2024
1 parent d42dbb7 commit c5c424f
Showing 1 changed file with 58 additions and 0 deletions.
58 changes: 58 additions & 0 deletions docs/src/content/docs/reference/scripts/tokenizers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Tokenizers
description: Tokenizers are used to split text into tokens.
sidebar:
order: 60
---

The `tokenizers` helper module providers a set of functions to split text into tokens.

```ts
const n = tokenizers.count("hello world")
```

## Choosing your tokenizer

By default, the `tokenizers` module uses the `large` tokenizer. You can change the tokenizer by passing the model identifier.

```ts 'model: "gpt-4o-mini"'
const n = await tokenizers.count("hello world", { model: "gpt-4o-mini" })
```

## `count`

Counts the number of tokens in a string.

```ts wrap
const n = await tokenizers.count("hello world")
```

## `truncate`

Drops a part of the string to fit into a token budget

```ts wrap
const truncated = await tokenizers.truncate("hello world", 5)
```

## `chunk`

Splits the text into chunks of a given token size. The chunk tries to find
appropriate chunking boundaries based on the document type.

```ts
const chunks = await tokenizers.chunk(env.files[0])
for(const chunk of chunks) {
...
}
```

You can configure the chunking size, overlap and add line numbers.

```ts wrap
const chunks = await tokenizers.chunk(env.files[0], {
chunkSize: 128,
chunkOverlap 10,
lineNumbers: true
})
```

0 comments on commit c5c424f

Please sign in to comment.