Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Async tokenizer #86

Merged
merged 28 commits into from
Jun 18, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
54b8c0d
feat: support async, wip
miri-bar Jun 16, 2024
23ecd60
feat: fix and add tests, examples, update readme
miri-bar Jun 16, 2024
a594d26
fix: poetry lock
miri-bar Jun 16, 2024
db4c4e4
fix: anyio -> aiofiles
miri-bar Jun 16, 2024
b6bd196
fix: try 3.8
miri-bar Jun 16, 2024
715961a
fix: remove 3.7 from tests
miri-bar Jun 16, 2024
db41c47
fix: poetry lock
miri-bar Jun 16, 2024
acbc002
fix: add 3.7 back
miri-bar Jun 16, 2024
1fa7d2f
fix: poetry lock
miri-bar Jun 16, 2024
63faa9f
fix: poetry.lock
asafgardin Jun 16, 2024
0b27027
ci: pipenv
asafgardin Jun 16, 2024
15366b8
fix: pipenv
asafgardin Jun 16, 2024
aa87f08
fix: pipenv
asafgardin Jun 16, 2024
172afce
fix: pyproject
asafgardin Jun 16, 2024
241fe6c
fix: lock
asafgardin Jun 16, 2024
abb40da
fix: version
asafgardin Jun 16, 2024
07b83b5
fix: Removed aiofiles
asafgardin Jun 16, 2024
18df1d6
ci: update python version,
miri-bar Jun 16, 2024
297bc04
Merge branch 'main' into async_tokenizer
miri-bar Jun 16, 2024
0e3ef22
fix: switch from aiofiles to anyio, remove redundant comments
miri-bar Jun 16, 2024
c0930ac
chore: poetry lock
miri-bar Jun 16, 2024
707b253
fix: disable initializing async classes directly, cr comments
miri-bar Jun 17, 2024
a4c976c
test: fix import
miri-bar Jun 17, 2024
4afa657
ci: add asyncio-mode to test workflow
miri-bar Jun 17, 2024
70c0e42
fix: to_thread -> run_in_executor
miri-bar Jun 17, 2024
fdbe9a8
ci: add asyncio
miri-bar Jun 17, 2024
9fe658f
fix: cr comments
miri-bar Jun 18, 2024
1f09ac0
fix: cr comments
miri-bar Jun 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ jobs:
poetry install --no-root --without dev
- name: Run Tests
run: |
poetry run pytest
poetry run pytest --asyncio-mode=auto
miri-bar marked this conversation as resolved.
Show resolved Hide resolved
- name: Upload pytest test results
uses: actions/upload-artifact@v3
with:
Expand Down
82 changes: 82 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,46 @@ poetry add ai21-tokenizer

### Tokenizer Creation

### Jamba Tokenizer

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
```

Another way would be to use our Jamba tokenizer directly:

```python
from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
miri-bar marked this conversation as resolved.
Show resolved Hide resolved
# Your code here
```

Another way would be to use our async Jamba tokenizer class method create:

```python
from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
```

### J2 Tokenizer

```python
from ai21_tokenizer import Tokenizer

Expand All @@ -52,6 +92,26 @@ config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
```

#### Async usage

```python
from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_async_tokenizer()
# Your code here
```

Another way would be to use our async Jamba tokenizer class method create:

```python
from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
```

### Functions

#### Encode and Decode
Expand All @@ -67,6 +127,18 @@ decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```

#### Async

```python
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
```

#### What if you had wanted to convert your tokens to ids or vice versa?

```python
Expand All @@ -76,4 +148,14 @@ print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
```

#### Async

```python
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
```

**For more examples, please see our [examples](examples) folder.**
9 changes: 6 additions & 3 deletions ai21_tokenizer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from ai21_tokenizer.base_tokenizer import BaseTokenizer
from ai21_tokenizer.jamba_instruct_tokenizer import JambaInstructTokenizer
from ai21_tokenizer.jurassic_tokenizer import JurassicTokenizer
from ai21_tokenizer.base_tokenizer import BaseTokenizer, AsyncBaseTokenizer
from ai21_tokenizer.jamba_instruct_tokenizer import JambaInstructTokenizer, AsyncJambaInstructTokenizer
from ai21_tokenizer.jurassic_tokenizer import JurassicTokenizer, AsyncJurassicTokenizer
from ai21_tokenizer.tokenizer_factory import TokenizerFactory as Tokenizer, PreTrainedTokenizers
from .version import VERSION

Expand All @@ -9,8 +9,11 @@
__all__ = [
"Tokenizer",
"JurassicTokenizer",
"AsyncJurassicTokenizer",
"BaseTokenizer",
"AsyncBaseTokenizer",
"__version__",
"PreTrainedTokenizers",
"JambaInstructTokenizer",
"AsyncJambaInstructTokenizer",
]
48 changes: 48 additions & 0 deletions ai21_tokenizer/base_jamba_instruct_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from __future__ import annotations

import os
import tempfile
from pathlib import Path
from typing import List, Union, Optional
from abc import ABC, abstractmethod

from tokenizers import Tokenizer

from ai21_tokenizer.file_utils import PathLike

_TOKENIZER_FILE = "tokenizer.json"
_DEFAULT_MODEL_CACHE_DIR = Path(tempfile.gettempdir()) / "jamba_instruct"


class BaseJambaInstructTokenizer(ABC):
_tokenizer: Optional[Tokenizer] = None

@abstractmethod
def _load_from_cache(self, cache_file: Path) -> Tokenizer:
pass

def _is_cached(self, cache_dir: PathLike) -> bool:
return Path(cache_dir).exists() and _TOKENIZER_FILE in os.listdir(cache_dir)

def _cache_tokenizer(self, tokenizer: Tokenizer, cache_dir: PathLike) -> None:
# create cache directory for caching the tokenizer and save it
Path(cache_dir).mkdir(parents=True, exist_ok=True)
tokenizer.save(str(cache_dir / _TOKENIZER_FILE))

def _encode(self, text: str, **kwargs) -> List[int]:
return self._tokenizer.encode(text, **kwargs).ids

def _decode(self, token_ids: List[int], **kwargs) -> str:
return self._tokenizer.decode(token_ids, **kwargs)

def _convert_tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
if isinstance(tokens, str):
return self._tokenizer.token_to_id(tokens)

return [self._tokenizer.token_to_id(token) for token in tokens]

def _convert_ids_to_tokens(self, token_ids: Union[int, List[int]]) -> Union[str, List[str]]:
if isinstance(token_ids, int):
return self._tokenizer.id_to_token(token_ids)

return [self._tokenizer.id_to_token(token_id) for token_id in token_ids]
Loading
Loading