We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
从Hugging Face使用了guwenbert,但是tokenization的结果仅仅是把一个句子分成一个个中文字符。想了解一下这是正常的吗。谢谢!
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('ethanyt/guwenbert-base') text = '贪生养命事皆同,独坐闲居意颇慵。入夏驱驰巢树鹊,经春劳役探花蜂。石炉香尽寒灰薄,铁磬声微古锈浓。寂寂虚怀无一念,任从苍藓没行踪。' tokens = tokenizer.tokenize(text) ids = tokenizer.convert_tokens_to_ids(tokens) print(tokens) print(ids)
结果:
`['贪', '生', '养', '命', '事', '皆', '同', ',', '独', '坐', '闲', '居', '意', '颇', '慵', '。', '入', '夏', '驱', '驰', '巢', '树', '鹊', ',', '经', '春', '劳', '役', '探', '花', '蜂', '。', '石', '炉', '香', '尽', '寒', '灰', '薄', ',', '铁', '磬', '声', '微', '古', '锈', '浓', '。', '寂', '寂', '虚', '怀', '无', '一', '念', ',', '任', '从', '苍', '藓', '没', '行', '踪', '。'] [1225, 38, 546, 190, 42, 94, 105, 5, 427, 424, 819, 231, 181, 1251, 4388, 4, 106, 452, 1571, 1367, 1779, 666, 2659, 5, 124, 224, 771, 980, 1806, 278, 2740, 4, 198, 2090, 389, 255, 353, 1864, 965, 5, 1148, 2761, 243, 547, 202, 7507, 2072, 4, 1185, 1185, 373, 843, 18, 10, 480, 5, 347, 122, 1155, 4338, 833, 49, 2353, 4]`
The text was updated successfully, but these errors were encountered:
你好,这是正常的。guwenbert的分词器是以字为单位的。
Sorry, something went wrong.
No branches or pull requests
从Hugging Face使用了guwenbert,但是tokenization的结果仅仅是把一个句子分成一个个中文字符。想了解一下这是正常的吗。谢谢!
结果:
The text was updated successfully, but these errors were encountered: