MinHash explanation: http://infolab.stanford.edu/~ullman/mmds/book.pdf (chapter 3, also archived here: https://perma.cc/K9B4-QTX3) A simple take here: https://moz.com/devblog/near-duplicate-detection/
This implementation borrows from Chris McCormick's MinHash tutorial. https://github.com/chrisjmccormick/MinHash
pip install -e "git+git://github.com/anastasia/minhash.git@master#egg=minhash"
python minhash.py doc1 doc2
import minhash
minhash.calculate(string_a, string_b)