Towards a Multimodal Document-grounded Conversational AI System for Education
Taneja, Karan, Singh, Anjali, Goel, Ashok K.
–arXiv.org Artificial Intelligence
Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Mu ltimodal Do cument-grounded C onversa-tional AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.
arXiv.org Artificial Intelligence
Apr-22-2025
- Country:
- North America > United States
- Georgia > Fulton County
- Atlanta (0.04)
- Texas > Travis County
- Austin (0.14)
- Georgia > Fulton County
- North America > United States
- Genre:
- Questionnaire & Opinion Survey (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Industry:
- Education > Educational Setting > Higher Education (0.68)
- Technology: