Unintended Impacts of LLM Alignment on Global Representation

Official Repository for the ACL 2024 Paper: Unintended Impacts of LLM Alignment on Global Representation

Figure 1: Country rewards for Starling 7B Reward Model prompted with "User: Where are you from? Assistant: I am from {country}." Starling assigns higher rewards to English-speaking Western nations and lower rewards to countries in the Middle East/Africa.

TLDR

This repository contains all the code for the ACL 2024 Paper Unintended Impacts of LLM Alignment on Global Representation. If you are looking for the AskRedditCountries dataset check out our huggingface.

This repository covers all the steps to reproduce the results in our paper exactly. We also include all the intermediate/final results in the /outputs/, /results/, and /visualization/ folders.

If you want to reproduce all experiments and plots in our paper, first download the md3 dataset following the instructions in /data/md3/md3/README.txt, here. Then run the following bash script:

./scripts/run_all.sh

Installation

conda create -n "alignment-impacts" python=3.11.5 ipython
conda activate alignment-impacts
pip install -r requirements.txt

Experiments

To run all experiments run the following script

./scripts/experiments/experiments.sh

Otherwise you can run the specific scripts below to reproduce specific experiments

Ask Starling Where its From

Run the "Where From" script

./scripts/experiments/0-where_from_reward_model.sh

Dialect Intent Detection

First download the md3 dataset following the instructions in /data/md3/md3/README.txt, here.

Next run the data cleaning script

./scripts/experiments/1-md3_clean.sh

Now you are set to run the md3 experiment script

./scripts/experiments/2-md3_experiments.sh

This will write the outputs to ./outputs/md3-game/.

Belebele Reading Comprehension

Run the Belebele Reading Comprehension script

./scripts/experiments/3-belebele_experiments.sh

TyDiQA Question Answering

Run the TyDiQA Question Answering script

./scripts/experiments/4-tydiqa_experiments.sh

Ultrachat and Tulu SFT Language ID

Run the Language ID script

./scripts/experiments/5-langid_experiments.sh

Global Opinions QA

Run the Global Opinions QA script

./scripts/experiments/6-globalopinions_experiments.sh

Ask Reddit Rewards

Run the Ask Reddit Country Opinions Reward Modeling script

./scripts/experiments/7-askreddit-rewards.sh

Ask Reddit Perplexities

Run the Ask Reddit Country Opinions Language Model perplexities script

./scripts/experiments/8-askreddit-perplexities.sh

Process Outputs to Results

Run the postprocessing script

./scripts/postprocessing/9-postprocessing.sh

This will take the outputs from ./outputs/ and process them into single csv files in the ./results/ directory

Process Results to Visuals

To run all analysis run the following script

./scripts/analysis/analysis.sh

Otherwise you can run the following scripts to reproduce specific plots

Where from cloropleth

Run the "Where From" analysis script

./scripts/analysis/10-where_from_chloropleth.sh

MD3 Plots

Run the md3 analysis script

./scripts/analysis/11-md3_game_analysis.sh

Belebele Plots

Run the belebele analysis script

./scripts/analysis/12-belebele_analysis.sh

Tydiqa Plots

Run the tydiqa analysis script

./scripts/analysis/13-tydiqa_analysis.sh

LangID Tables

Run the langid script for Tulu SFT and ultrachat

./scripts/analysis/14-langid.sh

Global Opinions Plots

Run the Global Opinions QA analysis script

./scripts/analysis/15-global-opinions.sh

Ask Reddit Chloropleth

Produce the chloropleth for the reward model giving country opinions on the full AskReddit dataset

./scripts/analysis/16-ask_reddit_chloropleth.sh

Ask Reddit Correlation

Produce the tables and plots for the reward model, language model, and US citizen correlations

./scripts/analysis/17-ask_reddit_correlation.sh

Contact

Citation

If you use this code or our AskRedditCountries dataset please cite our paper:

@inproceedings{ryan-etal-2024-unintended,
    title = "Unintended Impacts of {LLM} Alignment on Global Representation",
    author = "Ryan, Michael J  and
      Held, William  and
      Yang, Diyi",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.853",
    doi = "10.18653/v1/2024.acl-long.853",
    pages = "16121--16140",
    abstract = "Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unintended Impacts of LLM Alignment on Global Representation

TLDR

Table of Contents

Installation

Experiments

Ask Starling Where its From

Dialect Intent Detection

Belebele Reading Comprehension

TyDiQA Question Answering

Ultrachat and Tulu SFT Language ID

Global Opinions QA

Ask Reddit Rewards

Ask Reddit Perplexities

Process Outputs to Results

Process Results to Visuals

Where from cloropleth

MD3 Plots

Belebele Plots

Tydiqa Plots

LangID Tables

Global Opinions Plots

Ask Reddit Chloropleth

Ask Reddit Correlation

Contact

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
data		data
outputs		outputs
postprocessing		postprocessing
results		results
scripts		scripts
visualization		visualization
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rewards.png		rewards.png

License

SALT-NLP/unintended-impacts-of-alignment

Folders and files

Latest commit

History

Repository files navigation

Unintended Impacts of LLM Alignment on Global Representation

TLDR

Table of Contents

Installation

Experiments

Ask Starling Where its From

Dialect Intent Detection

Belebele Reading Comprehension

TyDiQA Question Answering

Ultrachat and Tulu SFT Language ID

Global Opinions QA

Ask Reddit Rewards

Ask Reddit Perplexities

Process Outputs to Results

Process Results to Visuals

Where from cloropleth

MD3 Plots

Belebele Plots

Tydiqa Plots

LangID Tables

Global Opinions Plots

Ask Reddit Chloropleth

Ask Reddit Correlation

Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages