Skip to content

Pre-trained Word2Vec models for Vietnamese

License

Notifications You must be signed in to change notification settings

haohoang/word2vecVN

 
 

Repository files navigation

I. word2vecVN

Word2Vec models for Vietnamese

Download models:

  • Model trained on Vietnamese Wiki: click here.
  • Model trained on Le et al.'s data (window-size 5, 400 dims): click here.
  • Model trained on Le et al.'s data (window-size 2, 300 dims): click here.

Visualization:

  • word2vec-visualization (using TensorBoard):
    • Download tf_files: TBA
    • Run $ tensorboard --log_dir=./ --port=10001
  • word2vec-simple-visualization: It is working well. Please read the readme file inside that folder to know how to test the model.

Note:

  • This model is trained using data of Le et al. http://mim.hus.vnu.edu.vn/phuonglh/node/72
    • Data information: 7.1G text with 1,675,819 unique words from a corpus of 974,393,244 raw words and 97,440 documents. Note that all words are tokenized words.

II. How do I cite?

Please CITE paper the Arxiv paper whenever ETNLP (or the pre-trained embeddings) is used to produce published results or incorporated into other software:

@article{vu:2019n,
  title={ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings},
  author={Xuan-Son Vu, Thanh Vu, Son N. Tran, Lili Jiang},
  journal={In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP)},
  year={2019}
  }
  
 @misc{word2vecvn_2016,
    author = {Xuan-Son Vu},
    title = {Pre-trained Word2Vec models for Vietnamese},
    year = {2016},
    howpublished = {\url{https://github.com/sonvx/word2vecVN}},
    note = {commit xxxxxxx}
  }

Cited in papers:

  1. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. bibtext, github

III. Some screenshots:

Screenshot of word2vec

Alt text

Screenshot of spacy-n-fastext

Alt text

Attributions/Thanks

About

Pre-trained Word2Vec models for Vietnamese

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 58.8%
  • CSS 22.5%
  • HTML 18.7%