Official Repository for the ACL 2024 Paper: Unintended Impacts of LLM Alignment on Global Representation
Figure 1: Country rewards for Starling 7B Reward Model prompted with "User: Where are you from? Assistant: I am from {country}." Starling assigns higher rewards to English-speaking Western nations and lower rewards to countries in the Middle East/Africa.
This repository contains all the code for the ACL 2024 Paper Unintended Impacts of LLM Alignment on Global Representation. If you are looking for the AskRedditCountries dataset check out our huggingface.
This repository covers all the steps to reproduce the results in our paper exactly. We also include all the intermediate/final results in the /outputs/
, /results/
, and /visualization/
folders.
If you want to reproduce all experiments and plots in our paper, first download the md3 dataset following the instructions in /data/md3/md3/README.txt
, here. Then run the following bash script:
./scripts/run_all.sh
conda create -n "alignment-impacts" python=3.11.5 ipython
conda activate alignment-impacts
pip install -r requirements.txt
To run all experiments run the following script
./scripts/experiments/experiments.sh
Otherwise you can run the specific scripts below to reproduce specific experiments
Run the "Where From" script
./scripts/experiments/0-where_from_reward_model.sh
First download the md3 dataset following the instructions in /data/md3/md3/README.txt
, here.
Next run the data cleaning script
./scripts/experiments/1-md3_clean.sh
Now you are set to run the md3 experiment script
./scripts/experiments/2-md3_experiments.sh
This will write the outputs to ./outputs/md3-game/
.
Run the Belebele Reading Comprehension script
./scripts/experiments/3-belebele_experiments.sh
Run the TyDiQA Question Answering script
./scripts/experiments/4-tydiqa_experiments.sh
Run the Language ID script
./scripts/experiments/5-langid_experiments.sh
Run the Global Opinions QA script
./scripts/experiments/6-globalopinions_experiments.sh
Run the Ask Reddit Country Opinions Reward Modeling script
./scripts/experiments/7-askreddit-rewards.sh
Run the Ask Reddit Country Opinions Language Model perplexities script
./scripts/experiments/8-askreddit-perplexities.sh
Run the postprocessing script
./scripts/postprocessing/9-postprocessing.sh
This will take the outputs from ./outputs/
and process them into single csv files in the ./results/
directory
To run all analysis run the following script
./scripts/analysis/analysis.sh
Otherwise you can run the following scripts to reproduce specific plots
Run the "Where From" analysis script
./scripts/analysis/10-where_from_chloropleth.sh
Run the md3 analysis script
./scripts/analysis/11-md3_game_analysis.sh
Run the belebele analysis script
./scripts/analysis/12-belebele_analysis.sh
Run the tydiqa analysis script
./scripts/analysis/13-tydiqa_analysis.sh
Run the langid script for Tulu SFT and ultrachat
./scripts/analysis/14-langid.sh
Run the Global Opinions QA analysis script
./scripts/analysis/15-global-opinions.sh
Produce the chloropleth for the reward model giving country opinions on the full AskReddit dataset
./scripts/analysis/16-ask_reddit_chloropleth.sh
Produce the tables and plots for the reward model, language model, and US citizen correlations
./scripts/analysis/17-ask_reddit_correlation.sh
Michael Ryan: Scholar | Twitter | Github | LinkedIn | Research Gate | Personal Website | [email protected]
If you use this code or our AskRedditCountries dataset please cite our paper:
@inproceedings{ryan-etal-2024-unintended,
title = "Unintended Impacts of {LLM} Alignment on Global Representation",
author = "Ryan, Michael J and
Held, William and
Yang, Diyi",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.853",
doi = "10.18653/v1/2024.acl-long.853",
pages = "16121--16140",
abstract = "Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.",
}