Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parametrization for the detokenization/decoding #1246

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pavel-esir
Copy link
Contributor

@pavel-esir pavel-esir commented Nov 21, 2024

image

Tokenizers IRs should be converted after openvinotoolkit/openvino_tokenizers#325 is merged

Ticket CVS-154151

@pavel-esir pavel-esir added the enhancement New feature or request label Nov 21, 2024
@pavel-esir pavel-esir added this to the 2025.0 milestone Nov 21, 2024
@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: GHA CI based on Github actions category: tokenizers Tokenizer class or submodule update category: Python API Python API for GenAI category: GenAI C++ API Changes in GenAI C++ public headers no-match-files labels Nov 21, 2024
src/cpp/include/openvino/genai/tokenizer.hpp Outdated Show resolved Hide resolved
src/cpp/include/openvino/genai/tokenizer.hpp Outdated Show resolved Hide resolved
// If user requested add_special_tokens mode different from the current one,
// need to set state variable.
// If requested mode matches the stored state set, then don't touch states.
if (add_special_tokens == m_add_special_tokens) {
if (add_special_tokens_flag == m_add_special_tokens && skip_special_tokens_flag == m_skip_special_tokens) {
return;
}
if (m_older_than_24_5) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does m_skip_special_tokens work with IRs produced by 2024.4 and older?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this PR depends on openvinotoolkit/openvino_tokenizers#325, than we need to add condition for 2025.0 for skip_special_tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they will will not with IR older that 2024.4 and older,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about 2025.0 condition, need to discuss with @apaniukov

src/cpp/src/tokenizer.cpp Show resolved Hide resolved
@@ -217,3 +217,25 @@ def test_add_special_tokens(add_special_tokens, prompt):
res_genai = genai_tokenzier.encode(prompt, add_special_tokens).input_ids.data
res_hf = hf_tokenizer(prompt, return_tensors="np", add_special_tokens=add_special_tokens)["input_ids"]
assert np.all(res_genai == res_hf)

@pytest.mark.precommit
@pytest.mark.xfail(reason="Need to turn them back on when openvino_tokenizers will be updated.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests will be green on CI only after genai will be update to the latest openvino_tokenizers. Meantime checked them locally, will enable back as soon as openvino_tokenziers will be updated.
image

@@ -87,23 +87,59 @@ class OPENVINO_GENAI_EXPORTS Tokenizer {
/**
* @brief decode sequence of tokens
* @param tokens vector storing tokens
* @param tokenization_params AnyMap with detokenization parameters, e.g. {"skip_special_tokens", false}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param tokenization_params AnyMap with detokenization parameters, e.g. {"skip_special_tokens", false}
* @param detokenization_params AnyMap with detokenization parameters, e.g. ov::genai::skip_special_tokens(false)

/**
* @brief decode tokens.
* @param tokens ov::Tensor with tokens with shape [batch_size, seq_len]
* @param tokenization_params AnyMap with detokenization parameters, e.g. {"skip_special_tokens", false}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, correct docs in all places.

},
py::arg("tokens"),
py::arg("tokens"), py::arg("skip_special_tokens") = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also incorrect, because actual default depends on how de-tokenize is converted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GenAI C++ API Changes in GenAI C++ public headers category: GHA CI based on Github actions category: LLM LLM pipeline (stateful, static) category: Python API Python API for GenAI category: tokenizers Tokenizer class or submodule update enhancement New feature or request no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants