SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

Open in new window