Towards a Multimodal Document-grounded Conversational AI System for Education