This repository will contain the dataset and models described in NAACL2021 paper: News Headline Grouping as a Challenging NLU Task.
Example Headline Groups from the International Space Station timeline.
In the release, we provide two versions of the HLGD:
- The original annotation of the 10 timelines present in HLGD, with annotator identities anonymized. For each headline, we populate: the URL of the headline (
url
), the headline text (headline
), the publication date (date
), the group annotations of 5 annotators (annot_1_group
, ...annot_5_group
), and an aggregate group (global_group
). - The classification compatible version of HLGD containing ~20,000 headline pairs and binary labels indicating whether the headlines are in the same global group or not. The classification dataset is integrated into HuggingFace's
datasets
library. The dataset can be loaded in the following way:
!pip install datasets
from datasets import load_dataset
data = load_dataset('hlgd')
Note: We considered the legal component of the release of HLGD, and consider that the release of the dataset falls under fair use. See more detail here
We release two models:
cls_elec_base_hlgd_0.74f1.bin
model corresponds to theElectra Finetune on HLGD + Time
in the paper. An example use of the model is provided in model_classifier.pygpt2med_headline_gen_1.645.bin
model corresponds to the headline generator used for theHeadline Generator Swap
results. An example use of the model is provided in model_generator_swap.py
If you make use of the code, models, or algorithm, please cite our paper:
@inproceedings{Laban2021NewsHG,
title={News Headline Grouping as a Challenging NLU Task},
author={Laban, Philippe and Bandarkar, Lucas and Hearst, Marti A},
booktitle={NAACL 2021},
publisher = {Association for Computational Linguistics},
year={2021}
}
If you'd like to contribute, or have questions or suggestions, you can contact us at [email protected]. All contributions welcome!