You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by Calvinnncy97 September 4, 2023
Hey guys,
Specifically, I would like to ask which flags are set to train these 2 tokenizers? I can't find any flags that force tokenizer to have only 1 word tokens.
Thank you.
The text was updated successfully, but these errors were encountered:
The oneword vocabularies (none of which are released) have -words-per-token 1 parameter set in getalltokens. Currently the words-per-token parameter is only implemented for strict and consistent modes.
I'd also recommend flags -mode consistent -charset UTF8 -only-latin -only-valid during getalltokens
Discussed in #22
Originally posted by Calvinnncy97 September 4, 2023
Hey guys,
Specifically, I would like to ask which flags are set to train these 2 tokenizers? I can't find any flags that force tokenizer to have only 1 word tokens.
Thank you.
The text was updated successfully, but these errors were encountered: