BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

Santos, João Guilherme Alves, Bonás, Giovana Kerche, Almeida, Thales Sales

arXiv.org Artificial Intelligence 

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. W e present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. W e evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found