UNA

This is the official code of our Paper "Unsupervised hard Negative Augmentation for contrastive learning"

Environments

This repository is tested on Python 3.8+

About

We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model.

Getting started

Environments

conda create -n una python=3.8
conda activate una
cd UNA
pip install -r requirements.txt

Dataset preparation

Training set

We used the training dataset from SimCSE, which can be downloaded by running the following script.

cd data/
bash data/download_wiki.sh

Prepare the paraphrased sentences

To create the paraphrased sentences, run the following script:

cd data/augment/
python paraphrase.py

Produce TF-IDF matrix offline

Run the following script to prepare the TD-IDF matrix. If you don't want to prepare the matrix offline, uncomment lines 94-101 in file data/dataset.py.

cd data/augment/
python create_dict.py

Change the mode to 'para' to produce the TF-IDF matrix for paraphrasing.

Evaluation set

The evaluation set can be downloaded by running the following script:

cd data/downstream/
bash download_dataset.sh

Code structure

After preparing the datasets, the structure of the code should look like this:

.
├── data  
│   ├── augment                      
│   ├── ├── paraphrase.py            # code for creating the paraphrased lines
│   ├── └── create_dict.py           # code for creating matrices under folder tfidf/
│   ├── downstream                   # folder containing evaluation dataset
│   ├── stsbenchmark                 # folder containing validation dataset
│   ├── training                     # folder containing training dataset
├── evaluate                         # Evaluation code *
├── function   
│   ├── metrics.py
│   ├── seed.py                      # initialize random seeds
│   └── tfidf_una.py                 # file for calculating the TF-IDF matrices and vectors for UNA
├── model 			
│   ├── lambda_scheduler.py          # contains different schedulers             
│   └── models.py                    # backbone BERT/RpBERTa model  
├── script                           # folder that contain .sh scripts to run the pre-training file
├── tfidf
│   ├── ori                          # folder for pre-saved TF-IDF representation of the original training dataset.
│   └── para                         # folder for pre-saved TF-IDF representation of the original training and the paraphrased dataset.
├── run.py                           # run pretraining with UNA
├── una.py                           # run pretraining with FaceSwap
└── utils.py

Train UNA

To reproduce our results (for STS) with UNA framework, run the following training scipt here.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
data		data
evaluate/senteval		evaluate/senteval
function		function
misc		misc
model		model
script		script
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
run.py		run.py
una.py		una.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNA

Environments

About

Getting started

Environments

Dataset preparation

Training set

Prepare the paraphrased sentences

Produce TF-IDF matrix offline

Evaluation set

Code structure

Train UNA

Trained Model

Results

Acknowledgement

About

Releases

Packages

Languages

ClaudiaShu/UNA

Folders and files

Latest commit

History

Repository files navigation

UNA

Environments

About

Getting started

Environments

Dataset preparation

Training set

Prepare the paraphrased sentences

Produce TF-IDF matrix offline

Evaluation set

Code structure

Train UNA

Trained Model

Results

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages