We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
\n
vocab.txt
It seems \n is causing token index shifting after the line 10295 in vocab.txt.
$ less -N vocab.txt ... 10294 ##錄 10295 10296 10297 する
Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained(). https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
save_pretrained()
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!
The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.
!!!DIFECTED!!!
Also see #57.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
It seems
\n
is causing token index shifting after the line 10295 invocab.txt
.Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing
save_pretrained()
.https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
The line 10295 in
vocab.txt
should be some non-existent word like!!!DIFECTED!!!
, I think.Also see #57.
The text was updated successfully, but these errors were encountered: