This project is a part of the Master Thesis "Clickbait anotomy: Identify clickbait with machine learning", for the Research Master in Humanities, specialising in Human Language Technology at Vrije Universiteit Amsterdam.
This project aims at analysing the linguistic features of clickbait in order to make a distinction between clickbait and non-clickbait and to engineer features for three different machine classifiers Logistic Regression, Random Forest and Support Vector Machine.
The results of the analysis shows that syntactic and semantic features are importance to detect clickbait headlines. In this project, 100-dimension word embeddings and encoded sequential part-of-speech and dependency tags are used to represent clickbait headlines, while a 100-dimension document embedding model is trained to represent the contents of clickbait. The best performance is achieved SVM clasiffier with word embeddings with the results of 0.82 precision and recall.
The Data for this project is two dataset: the Clickbait Challenge 2017 dataset and clickbait headline dataset from Chakraborty et al (2016). The firt dataset contains clickbait and non-clickbait headlines and contents. The second one only consist of headlines.
Download two datasets from https://zenodo.org/record/3346491 and https://github.com/bhargaviparanjape/clickbait
Create a directory "Data" in the same directory as the code
Unzip the data files from two links above, and put the data folders in "Data" folder
Run the scripts in this order:
I. For the linguistic analysis of clickbait and non-clickait
python preprocessing_data.py
python analyse_data.py
The results of the analysis in pdf format and stored in folder "Figures"
II. For the feature extraction from the two corpora
python balanced_data.py
python extract_features.py
The rsults are two models of embeddings stored in folder "Model" and feature vectors for training in folder "Vector"
II. For the training and evaluating of machine learning algorithms
python classifier
The results are reports on the performance of each classifier in txt format, stored in "Results"