This notebook provides examples of using representation engineering techniques from the paper to detect and mitigate bias in large language models. It loads a pretrained LLaMA and pipelines for representation reading and control. On a bias dataset, it shows how representation directions can be identified that correlate with race and gender. Then it demonstrates using representation control to make an LLaMA's outputs more fair and unbiased. For example, it generates clinical vignettes with more equal gender representation compared to the unconstrained model. Overall, this shows how the representation analysis and control methods from the paper can give us handles to understand and improve fairness and bias issues in LLMAs.
For more details, please check out section 6.3 of our RepE paper.