New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

The entry of `\n` in `vocab.txt` is causing token index shifting #64

Open

hiroshi-matsuda-rit opened this issue May 8, 2023 · 0 comments

Collaborator

hiroshi-matsuda-rit commented May 8, 2023

It seems \n is causing token index shifting after the line 10295 in vocab.txt.

$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357

Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.

Also see #57.

The text was updated successfully, but these errors were encountered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment