riverbed

Tools for content datamining and NLP at scale.

motiviation

Given a set of text content in human language, or code, we would like to:

Filter for quality, NSFW and potential illegal text
label and create classifiers for the content.
search, store and share the content to user and to other AI models

installation

git clone https://github.com/ontocord/riverbed/
chmod ugo+x /content/riverbed/bin/lmplz
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install dataset datasets fasttext indexed_gzip whoosh transformers sentencepiece spacy nltk fast-pytorch-kmeans mmh3 tqdm
git clone --recursive https://github.com/seomoz/simhash-py
rm simhash-py/simash/*.cpp
python simhash-py/setup.py install build_ext --inplace
pip install tsnecuda==3.0.1+cu112 -f https://tsnecuda.isx.ai/tsnecuda_stable.html
python -m spacy download en_core_web_md
python -m nltk.downloader stopwords

history

Originally written by Ontocord, LLC. Donated to LAION for the open source community.

Name		Name	Last commit message	Last commit date
Latest commit History 648 Commits
LAION		LAION
bin		bin
docs		docs
llm-data-quality		llm-data-quality
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

riverbed

motiviation

installation

history

About

Releases

Packages

Contributors 3

Languages

License

LAION-AI/riverbed

Folders and files

Latest commit

History

Repository files navigation

riverbed

motiviation

installation

history

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages