GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting

Yang, Kaichun, Chen, Jian

arXiv.org Artificial Intelligence 

We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agen-tic GPT -5 and multimodal GPT -4V, for difficult image instances, where GPT -4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT - 5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here; the Google Drive materials are here. Benchmarking visual literacy, i.e., "the ability and skill to read and interpret visually represented data and to extract information from data visualizations" [1] shapes progress in measuring AI's ability in handling visualization images. Often, the same tasks as designed to assess visual literacy questions traditionally performed by human observers are now being assigned to algorithms. Following this trend, our goal in this paper is to quantify the new GPT -5's ability to read charts. Specifically, we used questions where GPT -4V failed and other LLMs achieved only low accuracy, as reported in V erma et al.'s CHART -6 benchmark [2].