- 2016-01-13 fix for #46, thanks to @buhrmann for reporting
- 2016-01-16 format of vocabulary changed.
- do not keep
doc_proportions
. see #52. - add
stop_words
argument toprune_vocabulary
. signature also was changed.
- do not keep
- 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
- stored as
attr(corpus, 'ids')
- rownames in dtm
- names for dtm list in
lda_c
format
- stored as
First CRAN release of text2vec.
- Fast text vectorization with stable streaming API on arbitrary n-grams.
- Functions for vocabulary extraction and management
- Hash vectorizer (based on digest murmurhash3)
- Vocabulary vectorizer
- GloVe algorithm word embeddings.
- Fast term-cooccurence matrix factorization via parallel async AdaGrad.
- All core functions written in C++.