Author Attribution of "DerStandard Forum Writing Style"
Download stopwords first in preprocessing
##Setup First you need to install all the packes from the requirements.txt file. Maybe you create first a virtual environment and install then in there or install locally the packages. The easiest way to install the required packages would be:
pip install -r requirements.txt
and for the lemmatizer for the german language you need to download the lemmatizer from spacy:
!python -m spacy download de_core_news_md
Executing the system is easy simply call:
python main.py
In the main.py, you can set the VECTORIZATIONTYPE flag to different types of vectorizers(stylometry, bag of words, tf-idf) and depending on the set value, you will receive then a different feature matrix and will produce then results described in the report.
Changing the value FIXED_NUMBER_COMMENTS, changes how many authors with this number FIXED_NUMBER_COMMENTS of comments we use for further processes.
There are also additional flags, but these are most likely if you want to save the preprocessed dataframe locally, so the execution time reduces.
If you run all classifiers you can expect to wait up to ~3min, depending on whether you already use a preprocessed dataset from the harddrive or you need to preprocess the data, then it will take about ~5min.
Pc configuration used for these times:
CPU: AMD 5600x
RAM: 16GB DDR4 3200 MHZ
The assets folder contains the images we used for the report.
The dataset folder contains the million corpus dataset and the database which we got from the million corpus.
Dataset is available under https://ofai.github.io/million-post-corpus/#data-set-description
The models folder contains the used classifiers and also the deep learning network we tried, but didnt use in the later stages.
The papers folder contains the papers we refer to and we read to understand what is state of the art.
The preprocess folder contains two files:
The data_preprocessing.py
contains the logic for reducing the nr. of authors to the requested comment number e.g. 1000 comments per author.
The nlp_preprocessing.py contains all the preprocessing steps applied to the comments.
The vectorization folder contains the feature extraction methods we applied (stylometry, tf-idf, bag of words) in two files:
the vectorization.py, and stylometry.py.