Skip to content

PKU-ML/LongPPL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LongPPL

This repository is the official implementation for What is Wrong with Perplexity for Long-context Language Modeling?

Introduction

Handling long-context inputs is crucial for large language models (LLMs). While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose LongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Additionally, we introduce LongCE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens.

LongPPL

Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of -0.96), significantly outperforming traditional PPL in predictive accuracy. Besides, experimental results also show that LongCE attains consistent improvements in a plug-and-play solution.

Requirements

Python 3.10 + Pytorch 2.3 + Transformers 4.45

pip install -r requirements.txt

LongPPL

The code support calculating LongPPL on customized LLMs and datasets. Please run:

pip install longppl

or

git clone https://github.com/PKU-ML/LongPPL.git
cd LongPPL
pip install -e .

and use the following code to calculate LongPPL:

from longppl import compute_longppl

output = compute_longppl(text, model, evaluator_model, tokenizer, evaluator_tokenizer)
print(output['longppl'])

Reproduce the paper

LongPPL

To reproduce the LongPPL experiments in our paper, please run:

cd perplexity
sh run_ppl.sh

The evaluation data can be downloaded from GovReport (tokenized). Here are our main results.

Models LongPPL(Qwen-72B-Instruct) LongPPL(Mistral Large 2) LongPPL(Llama-3.1-8B) PPL
Mixtral-8x7B 2.08 2.50 1.74 3.67
FILM-7B 2.49 3.17 2.03 4.47
Mistral-7B 2.68 3.49 2.19 4.25
Qwen1.5-14B 2.97 2.93 2.33 5.23
Qwen2-7B 2.99 2.73 2.29 4.97
Phi-3-small 2.98 2.86 2.41 5.42
CLEX-7B 3.70 4.60 2.92 4.13
Yi-6B 3.62 3.92 2.86 5.11
Yarn-7B 3.67 4.88 3.10 4.17
  • While perplexity shows almost no correlation to their long-context performance measured by the benchmarks (please refer to our paper), LongPPL demonstrates a strong correlation.

LongCE

To conduct long-context finetuning with LongCE, run accelerate config and enable DeepSpeed acceleration. deepspeed/zero3.json was the configuration file used for training.

cd finetune
sh train.sh

The training data can be downloaded from PG19 and Pile-arxiv. To run models with eabf, please downgrade the version of transformers to 4.37.0.

Evaluation on Long-context Benchmark

In the paper, we evaluate models on LongBench, LongEval and RULER. Please refer to the respective code repositories.

Citation

If you use our code, please cite

@article{fang2024wrong,
      title={What is Wrong with Perplexity for Long-context Language Modeling?}, 
      author={Lizhe Fang and Yifei Wang and Zhaoyang Liu and Chenheng Zhang and Stefanie Jegelka and Jinyang Gao and Bolin Ding and Yisen Wang},
      year={2024},
      journal={arXiv preprint arXiv:2410.23771}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published