Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

Open in new window