Clickbait anotomy: Identify clickbait with machine learning

Overview

This project is a part of the Master Thesis "Clickbait anotomy: Identify clickbait with machine learning", for the Research Master in Humanities, specialising in Human Language Technology at Vrije Universiteit Amsterdam.

This project aims at analysing the linguistic features of clickbait in order to make a distinction between clickbait and non-clickbait and to engineer features for three different machine classifiers Logistic Regression, Random Forest and Support Vector Machine.

The results of the analysis shows that syntactic and semantic features are importance to detect clickbait headlines. In this project, 100-dimension word embeddings and encoded sequential part-of-speech and dependency tags are used to represent clickbait headlines, while a 100-dimension document embedding model is trained to represent the contents of clickbait. The best performance is achieved SVM clasiffier with word embeddings with the results of 0.82 precision and recall.

Data

The Data for this project is two dataset: the Clickbait Challenge 2017 dataset and clickbait headline dataset from Chakraborty et al (2016). The firt dataset contains clickbait and non-clickbait headlines and contents. The second one only consist of headlines.

README

Download two datasets from https://zenodo.org/record/3346491 and https://github.com/bhargaviparanjape/clickbait

Create a directory "Data" in the same directory as the code

Unzip the data files from two links above, and put the data folders in "Data" folder

Run the scripts in this order:

I. For the linguistic analysis of clickbait and non-clickait

  python preprocessing_data.py

  python analyse_data.py

The results of the analysis in pdf format and stored in folder "Figures"

II. For the feature extraction from the two corpora

  python balanced_data.py

  python extract_features.py

The rsults are two models of embeddings stored in folder "Model" and feature vectors for training in folder "Vector"

II. For the training and evaluating of machine learning algorithms

  python classifier

The results are reports on the performance of each classifier in txt format, stored in "Results"

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Data		Data
Figures/clickbait17-train-170331		Figures/clickbait17-train-170331
Model		Model
Processed_data		Processed_data
Results		Results
Train		Train
Vector		Vector
LICENSE		LICENSE
README.md		README.md
Thesis_Ngan-Nguyen.pdf		Thesis_Ngan-Nguyen.pdf
analyse_data.py		analyse_data.py
balanced_data.py		balanced_data.py
classifer.py		classifer.py
extract_features.py		extract_features.py
helpers.py		helpers.py
preprocess_data.py		preprocess_data.py
read_data.py		read_data.py
requirements.txt		requirements.txt
text_process.py		text_process.py
train_embeddings.py		train_embeddings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clickbait anotomy: Identify clickbait with machine learning

Overview

Data

README

About

Releases

Packages

Contributors 2

Languages

License

cltl-students/NganNguyen

Folders and files

Latest commit

History

Repository files navigation

Clickbait anotomy: Identify clickbait with machine learning

Overview

Data

README

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages