This repository contains the source code of the models submitted by NTUA-SLP team in SemEval 2018 tasks 1, 2 and 3.
- Task 1: Affect in Tweets https://arxiv.org/abs/1804.06658
- Task 2: Multilingual Emoji Prediction https://arxiv.org/abs/1804.06657
- Task 3: Irony Detection in English Tweets https://arxiv.org/abs/1804.06659
Please follow the steps below in order to be able to train our models:
pip install -r ./requirements.txt
The models were trained on top of word2vec embeddings pre-trained
on a big collection of Twitter messages. We collected a big dataset of
550M English Twitter messages posted from 12/2014 to 06/2017.
For training the word embeddings we used
Gensim's implementation
of word2vec.
For preprocessing the tweets we used ekphrasis.
Finally, we used the following parameteres for training the embeddings:
window_size = 6
, negative_sampling = 5
and min_count = 20
.
We freely share our pre-trained embeddings:
- ntua_twitter_300.txt: 300 dimensional embeddings.
- ntua_twitter_affect_310.txt: 310 dimensional embeddings, consisting of 300d word2vec embeddings + 10 affective dimensions.
Finally, you should put the downloaded embeddings file in the /embeddings
folder.
Our model definitions are stored in a python configuration file.
Each config contains the model parameters and things like the batch size,
number of epochs and embeddings file. You should update the
embeddings_file
parameter in the model's configuration in model/params.py
.
You can test that you have a working setup, by training a sentiment analysis model on SemEval 2017 Task 4A, which is used as a source task for transfer learning in Task 1.
First, start the visdom server, which is needed for visualizing the training progress.
python -m visdom.server
Then just run the experiment.
python model/pretraining/sentiment2017.py
- If you only care about the source code of our deep-learning models,
then look at the PyTorch modules in
modules/nn/
. - In particular,
modules/nn/attention.py
contains an implementation of a self-attention mechanism, which supports multi-layer attention. - The scripts for running an experiment are stored in
model/taskX
.
In order to make our codebase more accessible and easier to extend, we provide an overview of the structure of our project. The most important parts will be covered in greater detail.
datasets
: contains the datasets for the pretrainig (SemEval 2017 - Task4A)dataloaders
- contains scripts for loading the datasets and for tasks 1, 2 and 3embeddings
: in this folder you should put the word embedding files.logger
: contains the source code for theTrainer
class and the accompanying helper functions for experiment management, including experiment logging, checkpoint and early-stoping mechanism and visualization of the training process.model
: experiment runner scripts (dataset loading, training pipeline etc).pretraining
: the scripts for training the TL modelstask1
: the scripts for running the models for Task 1task2
: the scripts for running the models for Task 2task3
: the scripts for running the models for Task 3
modules
: the source code of the PyTorch deep-learning models and the baseline models.nn
: the source code of the PyTorch modulessklearn
: scikit-learn Transformers for implementing the baseline bag-of-word and neural bag-of-words models
out
: this directory contains the generated model predictions and their corresponding attention filespredict
: scripts for generating predictions from saved models.trained
: this is where all the model checkpoints are saved.utils
: contains helper functions
Note: Full documentation of the source code will be posted soon.