Evaluating Multimodal Large Language Models on Educational Textbook Question Answering
Alawwad, Hessa A., Zafar, Anas, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani
–arXiv.org Artificial Intelligence
Faculty of Computing and Information Technology & Center of Research Excellence in AI and Data Science King Abdulaziz University Jeddah, Saudi Arabia Abstract --Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaV A-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off; while retrieved context improves LLaV A's performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. Furthermore, fine-tuning highlights architectural differences; LLaMA 3.2-Vision's performance substantially improves to 71.16% on the test set, demonstrating its capacity to learn multimodal integration, whereas LLaV A's performance declines, indicating challenges with generalization. Our results underscore the challenges MLLMs face in modality prioritization and context integration, providing a benchmark and pointing to key directions for developing more robust AI-driven educational tools. Personal use of this material is permitted. This work has been accepted to the 2nd International Generative AI and Computational Language Modelling Conference (GACLM 2025) for publication in the proceedings. Answering curriculum-related questions in multimodal educational materials is a central challenge in AI for education, requiring systems to reason across complex multimodal contexts such as lengthy lessons, diagrams, and videos.
arXiv.org Artificial Intelligence
Jul-16-2025
- Country:
- Asia
- Middle East > Saudi Arabia
- Mecca Province > Jeddah (0.25)
- Pakistan > Sindh
- Karachi Division > Karachi (0.04)
- Middle East > Saudi Arabia
- Europe > Switzerland
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.86)
- Industry:
- Education (1.00)
- Technology: