This is a code plagiarism detector that allows you to detect plagiarism between a large corpus of code. The main idea is to calculate embeddings of texts using various methods and predict based on these embeddings using a fully connected neural network
The package is tested under Python 3. It can be installed via:
git clone https://github.com/tttonyalpha/plagiasm_detector
Now you need to put all the files in one folder and run the program from source code directory using comand
python3 ./predict.py path/to/folder/with/input/code
My program employs tree-sitter as a backend therefore supports all languages from there
[1]
A large-scale computational study of content preservation measures for text style transfer and paraphrase generation
Nikolay Babakov, David Dale, Varvara Logacheva, Alexander Panchenko
aclanthology.org