[ACL 2024 Findings] Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning
Columbia University* University of Macau‡
This repository holds the CHOCOLATE benchmark for assessing the factuality of chart captioning systems and facilitating the Chart Caption Factual Error Correction task. The dataset includes an error analysis for six different models on two distinct datasets. These includes:
- LVLM: GPT-4V, Bard (before Gemini)
- LLM-based Pipeline: DePlot + GPT-4
- Fine-tuned Model: ChartT5, MatCha, UniChart
Annotations are conducted on the VisText and Chart-to-Text (pew split) datasets. This ensures a wide range of data and types of factual errors. For more information, please visit our project page.
Results are shown in the below figure and table. We found that all captioning models often generate captions that are factually inconsistent with the input chart. In fact, even for highly capable LVLMs, their non-factual rate is a whopping 81.27%.
CHOCOLATE-LVLM | CHOCOLATE-LLM | CHOCOLATE-FT | ||||
---|---|---|---|---|---|---|
# Factual | # Non-factual | # Factual | # Non-factual | # Factual | # Non-factual | |
Sentence | 1,683 | 1,270 | 518 | 469 | 360 | 1,023 |
Caption | 74 | 321 | 27 | 169 | 112 | 484 |
- CHOCOLATE - The first factuality benchmark for chart captioning.
- CHOCOLATE is also used to establish the Chart Caption Factual Error Correction task.
- Comming soon
- The CHOCOLATE benchmark
- The ChartVE metric (khhuang/chartve)
- The Chart-To-Table model (khhuang/chart-to-table)
- Scripts for table-based error correction
- Evaluation scripts
We release the data for the CHOCOLATE benchmark at data/chocolate.json
. CHOCOLATE is also available on HuggingFace🤗.
Each instance in the json file corresponds to an annotation for a generated caption. Below, we illustrate the fields within each instance:
- sentences: A list of caption sentences.
- labels: A list of list, where the outer list correspond to sentence and the inner list corresponds to the errors within each sentence.
- model: A string that represents the model producing the caption.
- dataset: A string that represents which dataset the chart was sampled from.
- image_path: An URL to the chart image.
- _id: A unique identifier for this instance.
ChartVE is a visual entailment model for evaluating the factuality of a generated caption sentence with regard to the input chart. The model takes in a chart figure and a caption sentence as input, and outputs an entailment probability. The underlying architecture of this model is UniChart.
Note that this model expects a caption sentence as textual inputs. For captions that are longer than one sentences, one should split the caption into multiple sentences, feed individual sentences to ChartVE, and then aggregate the scores. Below, we provide an example of how to use ChartVE.
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
model_name = "khhuang/chartve"
model = VisionEncoderDecoderModel.from_pretrained(model_name).cuda()
processor = DonutProcessor.from_pretrained(model_name)
image_path = "PATH_TO_IMAGE"
def format_query(sentence):
return f"Does the image entails this statement: \"{sentence}\"?"
# Format text inputs
CAPTION_SENTENCE = "The state that has the highest number of population is California."
query = format_query(CAPTION_SENTENCE)
# Encode chart figure and tokenize text
img = Image.open(IMAGE_PATH)
pixel_values = processor(img.convert("RGB"), random_padding=False, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()
decoder_input_ids = processor.tokenizer(query, add_special_tokens=False, return_tensors="pt", max_length=510).input_ids.cuda()
outputs = model(pixel_values, decoder_input_ids=decoder_input_ids)
# positive_logit = outputs['logits'].squeeze()[-1,49922]
# negative_logit = outputs['logits'].squeeze()[-1,2334]
# Probe the probability of generating "yes"
binary_entail_prob_positive = torch.nn.functional.softmax(outputs['logits'].squeeze()[-1,[2334, 49922]])[1].item()
# binary_entail_prob_positive corresponds to the computed probability that the chart entails the caption sentence.
The meta-evaluation scripts can be found in ChartVE Meta-evaluation.ipynb.
The proposed C2TFEC framework consists of two components: chart-to-table conversion and table-based error rectification.
The Chart-To-Table model (khhuang/chart-to-table) is trained to convert a chart into a structured table. The generated tables use &&& to delimit rows and | to delimit columns. The underlying architecture of this model is UniChart. Below, we provide an example of how to use our Chart-To-Table model.
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
model_name = "khhuang/chart-to-table"
model = VisionEncoderDecoderModel.from_pretrained(model_name).cuda()
processor = DonutProcessor.from_pretrained(model_name)
image_path = "PATH_TO_IMAGE"
def format_query(sentence):
return f"Does the image entails this statement: \"{sentence}\"?"
# Format text inputs
input_prompt = "<data_table_generation> <s_answer>"
# Encode chart figure and tokenize text
img = Image.open(IMAGE_PATH)
pixel_values = processor(img.convert("RGB"), random_padding=False, return_tensors="pt").pixel_values
pixel_values = pixel_values.cuda()
decoder_input_ids = processor.tokenizer(input_prompt, add_special_tokens=False, return_tensors="pt", max_length=510).input_ids.cuda()
# Generate a table
outputs = model.generate(
pixel_values.to(device),
decoder_input_ids=decoder_input_ids.to(device),
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=4,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)
sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
# Extract the data table
extracted_table = sequence.split("<s_answer>")[1].strip()
@inproceedings{huang-etal-2024-lvlms,
title = "Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning",
author = "Huang, Kung-Hsiang and
Zhou, Mingyang and
Chan, Hou Pong and
Fung, Yi R. and
Wang, Zhenhailong and
Zhang, Lingyu and
Chang, Shih-Fu and
Ji, Heng",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.85",
doi = "10.18653/v1/2023.findings-acl.85",
pages = "1314--1326",
}