This module contains the NLP support with Huggingface tokenizers implementation.
This is an implementation from Huggingface tokenizers RUST API.
The latest javadocs can be found on here.
You can also build the latest javadocs locally using the following command:
./gradlew javadoc
The javadocs output is built in the build/doc/javadoc
folder.
You can pull the module from the central Maven repository by including the following dependency in your pom.xml
file:
<dependency>
<groupId>ai.djl.huggingface</groupId>
<artifactId>tokenizers</artifactId>
<version>0.31.0</version>
</dependency>
If you are trying to convert a complete HuggingFace (transformers) model, you can try to use our all-in-one conversion solution to convert to Java:
Currently, this converter supports the following tasks:
- fill-mask
- question-answering
- sentence-similarity
- text-classification
- token-classification
You can install djl-converter
from djl master branch or clone the repository and install from source:
# install release version of djl-converter
pip install https://publish.djl.ai/djl_converter/djl_converter-0.30.0-py3-none-any.whl
# install from djl master branch
pip install "git+https://github.com/deepjavalibrary/djl.git#subdirectory=extensions/tokenizers/src/main/python"
# install djl-convert from local djl repo
git clone https://github.com/deepjavalibrary/djl.git
cd djl/extensions/tokenizers/src/main/python
python3 -m pip install -e .
# install optimum if you want to convert to OnnxRuntime
pip install optimum
# convert a single model to TorchScript, Onnxruntime or Rust
djl-convert --help
# import models as DJL Model Zoo
djl-import --help
djl-convert -m deepset/bert-base-cased-squad2
This will find converted model in model/bert-base-cased-squad2/
folder:
djl-convert -m deepset/bert-base-cased-squad2
djl-convert -m deepset/bert-base-cased-squad2 -f OnnxRuntime
djl-convert -m deepset/bert-base-cased-squad2 -f Rust
Then, all you need to do, is to load this model in DJL:
Criteria<QAInput, String> criteria = Criteria.builder()
.setTypes(QAInput.class, String.class)
.optModelPath(Paths.get("model/bert-base-cased-squad2/"))
.optTranslatorFactory(new DeferredTranslatorFactory())
.optProgress(new ProgressBar()).build();
djl-import -m deepset/bert-base-cased-squad2
This will generate a zip file into your local djl model zoo folder structure:
model/nlp/question_answer/ai/djl/huggingface/pytorch/deepset/bert-base-cased-squad2/0.0.1/bert-base-cased-squad2.zip
In most of the cases, you can easily use a pre-existing tokenizer in DJL:
Python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5")
Java
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("sentence-transformers/msmarco-distilbert-dot-v5");
This way requires network connection to huggingface repo.
The way to determine if you can use this way is through looking into the "Files and versions"
in HuggingFace model tab
and see if there is a tokenizer.json
.
If there is a tokenizer.json
, you can get it directly through DJL. Otherwise, use the other way below to obtain
a tokenizer.json
.
If you are trying to get tokenizer from a HuggingFace pipeline,
you can use the followings to extract tokenizer.json
file.
Python
pipeline.tokenizer.save_pretrained("./")
From your local directory, you will find a tokenizer.json
file.
Java
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance(Paths.get("./tokenizer.json"));
Same as above step, just save your tokenizer into tokenizer.json
(done by huggingface).