Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer 工作异常 #25

Open
Lizi-12 opened this issue Feb 19, 2024 · 1 comment
Open

Tokenizer 工作异常 #25

Lizi-12 opened this issue Feb 19, 2024 · 1 comment

Comments

@Lizi-12
Copy link

Lizi-12 commented Feb 19, 2024

从Hugging Face使用了guwenbert,但是tokenization的结果仅仅是把一个句子分成一个个中文字符。想了解一下这是正常的吗。谢谢!

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ethanyt/guwenbert-base')

text = '贪生养命事皆同,独坐闲居意颇慵。入夏驱驰巢树鹊,经春劳役探花蜂。石炉香尽寒灰薄,铁磬声微古锈浓。寂寂虚怀无一念,任从苍藓没行踪。'

tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(ids)

结果:

`['贪', '生', '养', '命', '事', '皆', '同', ',', '独', '坐', '闲', '居', '意', '颇', '慵', '。', '入', '夏', '驱', '驰', '巢', '树', '鹊', ',', '经', '春', '劳', '役', '探', '花', '蜂', '。', '石', '炉', '香', '尽', '寒', '灰', '薄', ',', '铁', '磬', '声', '微', '古', '锈', '浓', '。', '寂', '寂', '虚', '怀', '无', '一', '念', ',', '任', '从', '苍', '藓', '没', '行', '踪', '。']

[1225, 38, 546, 190, 42, 94, 105, 5, 427, 424, 819, 231, 181, 1251, 4388, 4, 106, 452, 1571, 1367, 1779, 666, 2659, 5, 124, 224, 771, 980, 1806, 278, 2740, 4, 198, 2090, 389, 255, 353, 1864, 965, 5, 1148, 2761, 243, 547, 202, 7507, 2072, 4, 1185, 1185, 373, 843, 18, 10, 480, 5, 347, 122, 1155, 4338, 833, 49, 2353, 4]`
@Ethan-yt
Copy link
Owner

Ethan-yt commented Jul 1, 2024

你好,这是正常的。guwenbert的分词器是以字为单位的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants