Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Alawwad, Hessa A., Zafar, Anas, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani

Jul-16-2025–arXiv.org Artificial Intelligence

Faculty of Computing and Information Technology & Center of Research Excellence in AI and Data Science King Abdulaziz University Jeddah, Saudi Arabia Abstract --Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaV A-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off; while retrieved context improves LLaV A's performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. Furthermore, fine-tuning highlights architectural differences; LLaMA 3.2-Vision's performance substantially improves to 71.16% on the test set, demonstrating its capacity to learn multimodal integration, whereas LLaV A's performance declines, indicating challenges with generalization. Our results underscore the challenges MLLMs face in modality prioritization and context integration, providing a benchmark and pointing to key directions for developing more robust AI-driven educational tools. Personal use of this material is permitted. This work has been accepted to the 2nd International Generative AI and Computational Language Modelling Conference (GACLM 2025) for publication in the proceedings. Answering curriculum-related questions in multimodal educational materials is a central challenge in AI for education, requiring systems to reason across complex multimodal contexts such as lengthy lessons, diagrams, and videos.

large language model, llama 3, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Jul-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.25)

Genre:
- Research Report > New Finding (0.86)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found