MABEL: Attenuating Gender Bias using Textual Entailment Data

Authors: Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen

This repository contains the code for our EMNLP 2022 paper, "MABEL: Attenuating Gender Bias using Textual Entailment Data".

MABEL (a Method for Attenuating Bias using Entailment Labels) is a task-agnostic intermediate pre-training technique that leverages entailment pairs from NLI data to produce representations which are both semantically capable and fair. This approach exhibits a good fairness-performance tradeoff across intrinsic and extrinsic gender bias diagnostics, with minimal damage on natural language understanding tasks.

Quick Start

With the transformers package installed, you can import the off-the-shelf model like so:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

model = AutoModelForMaskedLM.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

Model List

MABEL Models	ICAT ↑
princeton-nlp/mabel-bert-base-uncased	73.98
princeton-nlp/mabel-bert-large-uncased	73.45
princeton-nlp/mabel-roberta-base	69.68
princeton-nlp/mabel-roberta-large	69.49

Note: The ICAT score is a bias metric that consolidates a model's capacity for language modeling and stereotypical association into a single numerical indicator. More information can be found in the StereoSet (Nadeem et al., 2021) paper.

Training

Before training, make sure that the counterfactually-augmented NLI data, processed from SNLI and MNLI, is downloaded and stored under the training directory as entailment_data.csv.

1. Install package dependencies

pip install -r requirements.txt

2. Run training script

cd training
chmod +x run.sh 
./run.sh

You can configure the hyper-parameters in run.sh accordingly. Models are saved to out/. The optimal set of hyper-parameters varies depending on the choice of backbone encoder, and the full training details can be found in the paper.

Evaluation

Intrinsic Metrics

If you use your own trained model instead of our provided HF checkpoint, you must first run python -m training.convert_to_hf --path /path/to/your/checkpoint --base-model bert (which converts the checkpoint to a standard BertForMaskedLM model - use --base_model roberta for RobertaForMaskedLM) prior to intrinsic evaluation.

Also, please note that we use Meade et al.'s method of computation and datasets for both StereoSet and CrowS-Pairs; this is why the metrics for the pre-trained models are not directly comparable to those reported in the original benchmark papers.

1. StereoSet (Nadeem et al., 2021)

Command:

python -m benchmark.intrinsic.stereoset.predict --model_name_or_path princeton-nlp/mabel-bert-base-uncased && 
python -m benchmark.intrinsic.stereoset.eval

Output:

intrasentence
gender
Count: 2313.0
LM Score: 84.5453251710623
SS Score: 56.248299466465376
ICAT Score: 73.98003496789251

Collective Results:

Models	LM ↑	SS ◇	ICAT ↑
bert-base-uncased	84.17	60.28	66.86
princeton-nlp/mabel-bert-base-uncased	84.54	56.25	73.98
bert-large-uncased	86.54	63.24	63.62
princeton-nlp/mabel-bert-large-uncased	84.93	56.76	73.45
roberta-base	88.93	66.32	59.90
princeton-nlp/mabel-roberta-base	87.44	60.14	69.68
roberta-large	88.81	66.82	58.92
princeton-nlp/mabel-roberta-large	89.72	61.28	69.49

◇: The closer to 50, the better.

2. CrowS-Pairs (Nangia et al., 2021)

Command:

python -m benchmark.intrinsic.crows.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased

Output:

====================================================================================================
Total examples: 262
Metric score: 50.76
Stereotype score: 51.57
Anti-stereotype score: 49.51
Num. neutral: 0.0
====================================================================================================

Collective Results:

Models	Metric Score ◇
bert-base-uncased	57.25
princeton-nlp/mabel-bert-base-uncased	50.76
bert-large-uncased	55.73
princeton-nlp/mabel-bert-large-uncased	51.15
roberta-base	60.15
princeton-nlp/mabel-roberta-base	49.04
roberta-large	60.15
princeton-nlp/mabel-roberta-large	54.41

◇: The closer to 50, the better.

Extrinsic Metrics

Occupation Classification

See benchmark/extrinsic/occ_cls/README.md for full training instructions and results.

Natural Language Inference

See benchmark/extrinsic/nli/README.md for full training instructions and results.

Coreference Resolution

See benchmark/extrinsic/coref/README.md for full training instructions and results.

Language Understanding

1. GLUE (Wang et al., 2018)

We fine-tune on GLUE through the transformers library, following the default hyper-parameters.

A straightforward way is to download the current transformers repository:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

Then set up the environment dependencies:

cd ./examples/pytorch/text-classification
pip install -r requirements.txt

Here is a sample script for one of the GLUE tasks, MRPC:

# task options: cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte 
export TASK_NAME=mrpc
export OUTPUT_DIR=out/

CUDA_VISIBLE_DEVICES=0 python run_glue.py \
  --model_name_or_path princeton-nlp/mabel-bert-base-uncased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir $OUTPUT_DIR

2. SentEval Transfer Tasks (Conneau et al., 2018)

Preprocess:

Make sure you have cloned the SentEval repo and added its contents into this repository's transfer folder, and run ./get_transfer_data.bash in data/downstream to download the evaluation data.

Command:

python -m benchmark.transfer.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased --task_set transfer

Output:

+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 78.33 | 85.83 | 93.78 | 89.13 | 85.50 | 85.20 | 68.87 | 83.81 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Collective Results:

Models	Transfer Avg. ↑
bert-base-uncased	83.73
princeton-nlp/mabel-bert-base-uncased	83.81
bert-large-uncased	86.54
princeton-nlp/mabel-bert-large-uncased	86.09

Code Acknowledgements

Evaluation code for StereoSet and CrowS-Pairs is adapted from "An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models" (Meade et al., 2022).
Model implementation code is adapted from SimCSE (Gao et al., 2021).
Evaluation code for the transfer tasks relies on the SentEval package here, and adapts from a script prepared by SimCSE (Gao et al., 2021).
Evaluation code for GLUE relies on the Huggingface implementation of the transformers (Wolf et al., 2019) package.
Training and evaluation for e2e span-based coreference resolution follows from this Pytorch implementation (Xu and Choi, 2020).
Repository is formatted with .

Citation

@inproceedings{he2022mabel,
   title={{MABEL}: Attenuating Gender Bias using Textual Entailment Data},
   author={He, Jacqueline and Xia, Mengzhou and Fellbaum, Christiane and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MABEL: Attenuating Gender Bias using Textual Entailment Data

Table of Contents

Quick Start

Model List

Training

Evaluation

Intrinsic Metrics

Extrinsic Metrics

Language Understanding

Code Acknowledgements

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MABEL: Attenuating Gender Bias using Textual Entailment Data

Table of Contents

Quick Start

Model List

Training

Evaluation

Intrinsic Metrics

Extrinsic Metrics

Language Understanding

Code Acknowledgements

Citation