Tokenizer 工作异常 #25

Lizi-12 · 2024-02-19T01:49:23Z

从Hugging Face使用了guwenbert，但是tokenization的结果仅仅是把一个句子分成一个个中文字符。想了解一下这是正常的吗。谢谢！

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ethanyt/guwenbert-base')

text = '贪生养命事皆同，独坐闲居意颇慵。入夏驱驰巢树鹊，经春劳役探花蜂。石炉香尽寒灰薄，铁磬声微古锈浓。寂寂虚怀无一念，任从苍藓没行踪。'

tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(ids)

结果：

`['贪', '生', '养', '命', '事', '皆', '同', '，', '独', '坐', '闲', '居', '意', '颇', '慵', '。', '入', '夏', '驱', '驰', '巢', '树', '鹊', '，', '经', '春', '劳', '役', '探', '花', '蜂', '。', '石', '炉', '香', '尽', '寒', '灰', '薄', '，', '铁', '磬', '声', '微', '古', '锈', '浓', '。', '寂', '寂', '虚', '怀', '无', '一', '念', '，', '任', '从', '苍', '藓', '没', '行', '踪', '。']

[1225, 38, 546, 190, 42, 94, 105, 5, 427, 424, 819, 231, 181, 1251, 4388, 4, 106, 452, 1571, 1367, 1779, 666, 2659, 5, 124, 224, 771, 980, 1806, 278, 2740, 4, 198, 2090, 389, 255, 353, 1864, 965, 5, 1148, 2761, 243, 547, 202, 7507, 2072, 4, 1185, 1185, 373, 843, 18, 10, 480, 5, 347, 122, 1155, 4338, 833, 49, 2353, 4]`

The text was updated successfully, but these errors were encountered:

Ethan-yt · 2024-07-01T04:02:23Z

你好，这是正常的。guwenbert的分词器是以字为单位的。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer 工作异常 #25

Tokenizer 工作异常 #25

Lizi-12 commented Feb 19, 2024 •

edited

Loading

Ethan-yt commented Jul 1, 2024

Tokenizer 工作异常 #25

Tokenizer 工作异常 #25

Comments

Lizi-12 commented Feb 19, 2024 • edited Loading

Ethan-yt commented Jul 1, 2024

Lizi-12 commented Feb 19, 2024 •

edited

Loading