Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The entry of \n in vocab.txt is causing token index shifting #64

Open
hiroshi-matsuda-rit opened this issue May 8, 2023 · 0 comments
Open

Comments

@hiroshi-matsuda-rit
Copy link
Collaborator

It seems \n is causing token index shifting after the line 10295 in vocab.txt.

$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357

Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.

Also see #57.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant