Learning Visually Grounded Domain Ontologies via Embodied Conversation and Explanation