token type ids can be set by optional argument up to python wrapper #1418

hachall · 2023-08-14T11:46:39Z

Description

This PR adds an optional token_type_ids argument to Encoder.translate_batch, to facilitate BERT models with multiple input features (sentence-sentence entailment, context and question answering etc.). The token_type_ids vector is optional at the pybind level, and overloaded into the future_batch_async functions, with all previous endpoints preserved for backwards compatibility.

Closes #1383.

...
output = ct2_model.forward_batch(input_ids, token_type_ids=token_type_ids)
pooler_output = output.pooler_output
pooler_output = np.array(pooler_output)
pooler_output = torch.as_tensor(pooler_output)
ct_logits = classifier(pooler_output)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Docs update
Refactor

Testing

The code passes all C++ and python tests. For validation, the modified encoder was tested with both a bert-base-uncased and fine tuned bert-tiny for inference on the train split of the MRPC dataset (sentence pairs), and all output logits verified equal (<10e-12) to the output of a hugging face loaded model at full precision, when passing token_type_ids to both. When quantized, the logits are no longer equal, but the token_type_ids demonstrably improve the classifier accuracy.

Checklist:

My code follows the style guidelines of this project
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

hachall · 2023-08-14T13:48:54Z

#1383

include/ctranslate2/encoder.h

include/ctranslate2/models/language_model.h

python/cpp/encoder.cc

src/models/language_model.cc

python/cpp/encoder.cc

hachall · 2023-08-18T16:25:39Z

Thank you for the notes @guillaumekln! I have implemented the changes, let me know if there's anything else I can do

token type ids can be set by optional argument at python wrapper

dc9357f

guillaumekln reviewed Aug 17, 2023

View reviewed changes

hachall and others added 2 commits August 18, 2023 14:36

changed function overloads to function defaults

8baff13

space styling

67182ef

Fix formatting and default argument values

0d8e1ff

guillaumekln force-pushed the token_type_ids branch from cc34ad6 to 0d8e1ff Compare August 28, 2023 08:31

guillaumekln merged commit e3d0bb0 into OpenNMT:master Aug 28, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token type ids can be set by optional argument up to python wrapper #1418

token type ids can be set by optional argument up to python wrapper #1418

hachall commented Aug 14, 2023 •

edited by guillaumekln

Loading

hachall commented Aug 14, 2023

hachall commented Aug 18, 2023

token type ids can be set by optional argument up to python wrapper #1418

token type ids can be set by optional argument up to python wrapper #1418

Conversation

hachall commented Aug 14, 2023 • edited by guillaumekln Loading

Description

Type of change

Testing

Checklist:

hachall commented Aug 14, 2023

hachall commented Aug 18, 2023

hachall commented Aug 14, 2023 •

edited by guillaumekln

Loading