Skip to content

spaCn/how-to-make-chinese-models-for-spacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

如何基于公开语料库构建spaCy中文模型

准备资料下载链接

FastText cc zh 300 vec trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

Tencent AILab Chinese Embedding This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data.

UD_Chinese-GSDSimp UD_Chinese-GSD经过转化修正之后的简体中文版

CLUENER2020 CLUENER2020数据集,是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS。数据包含10个标签类别,训练集共有10748条语料,验证集共有1343条语料。谷歌下载地址 项目里包含了一份。

词向量、词性(UD_Chinese-GSDSimp)、句法依赖(UD_Chinese-GSDSimp)与实体识别(CLUENER2020)

  1. convert to spacy

    • 用fasttext/Tencent AILab Chinese Embedding的vectors初始化一个spacy模型
    python -m spacy init-model zh ./zh_vectors_init -v cc.zh.300.vec.gz
    or
    python -m spacy init-model zh ./zh_vectors_init -v Tencent_AILab_ChineseEmbedding.tar.gz
    • 转换ud库格式
    python -m spacy convert UD_Chinese-GSDSimp-master\zh_gsdsimp-ud-train.conllu ./ -t jsonl
    
    python -m spacy convert UD_Chinese-GSDSimp-master\zh_gsdsimp-ud-dev.conllu ./ -t jsonl
    • 转换clue ner标注数据格式
    python scripts/convert2spacy.py
  2. train

    python -m spacy train zh ./zh_vectors_web_ud_lg zh_gsdsimp-ud-train.json zh_gsdsimp-ud-dev.json --base-model ./zh_vectors_init
    
    python scripts/train_ner.py

注意事项

  • Windows用户要注意spacy 2.2.3版本训练的时候想用GPU的话要把thinc升级到7.4.0

    pip install -U thinc

transformers models download url

因为一些年轻人可能不知道的原因,预训练模型有的时候下载不下来,所以推荐用可以断点续传的工具下载。

bert-base-chinese config bert-base-chinese model bin bert-base-chinese vocab

想不翻墙仅获取pytorch模型下载地址的话可以用,全都要的请点击链接https://huggingface.co/models

python ./script/get_transformers_models_url.py bert-base-chinese -mk -local

⚠ ./trf_models/bert-base-chinese already exists
⚠  ================url================
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
⚠  ================url================
✔  使用下载工具下载后,将模型文件放入缓存文件夹中。
ValueError: 本地Class中未找到 't5-3b'的配置,请去掉-local试一下。

使用spacy-transformers init Chinese model

将下载的模型文件名整体去掉bert-base-chinese-

python ./spacy-transformers/init_model.py

ℹ Creating model for 'bert-base-chinese' (zh)
✔ Initialized the model pipeline
✔ Saved 'bert-base-chinese' (zh)
Pipeline: ['sentencizer', 'trf_wordpiecer', 'trf_tok2vec']
Location: ./spacy_trf_zh
✔ Model loads!
python -m spacy train zh ./zh_bert_ud zh_gsdsimp-ud-train.json zh_gsdsimp-ud-dev.json --base-model ./spacy_trf_zh

演示

dep ner

Todo

  • 添加腾讯AI Lab Embedding地址
  • msra语料与onto 5语料训练
  • spacy-transformers zh模型

License: CC BY-SA 4.0

About

如何基于公开语料库构建spaCy中文模型。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages