This is the official code of our Paper "Unsupervised hard Negative Augmentation for contrastive learning"
This repository is tested on Python 3.8+
We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model.
conda create -n una python=3.8
conda activate una
cd UNA
pip install -r requirements.txt
We used the training dataset from SimCSE, which can be downloaded by running the following script.
cd data/
bash data/download_wiki.sh
To create the paraphrased sentences, run the following script:
cd data/augment/
python paraphrase.py
Run the following script to prepare the TD-IDF matrix. If you don't want to prepare the matrix offline, uncomment lines 94-101 in file data/dataset.py
.
cd data/augment/
python create_dict.py
Change the mode to 'para' to produce the TF-IDF matrix for paraphrasing.
The evaluation set can be downloaded by running the following script:
cd data/downstream/
bash download_dataset.sh
After preparing the datasets, the structure of the code should look like this:
.
├── data
│ ├── augment
│ ├── ├── paraphrase.py # code for creating the paraphrased lines
│ ├── └── create_dict.py # code for creating matrices under folder tfidf/
│ ├── downstream # folder containing evaluation dataset
│ ├── stsbenchmark # folder containing validation dataset
│ ├── training # folder containing training dataset
├── evaluate # Evaluation code *
├── function
│ ├── metrics.py
│ ├── seed.py # initialize random seeds
│ └── tfidf_una.py # file for calculating the TF-IDF matrices and vectors for UNA
├── model
│ ├── lambda_scheduler.py # contains different schedulers
│ └── models.py # backbone BERT/RpBERTa model
├── script # folder that contain .sh scripts to run the pre-training file
├── tfidf
│ ├── ori # folder for pre-saved TF-IDF representation of the original training dataset.
│ └── para # folder for pre-saved TF-IDF representation of the original training and the paraphrased dataset.
├── run.py # run pretraining with UNA
├── una.py # run pretraining with FaceSwap
└── utils.py
To reproduce our results (for STS) with UNA framework, run the following training scipt here.
Models can be downloaded from here.