Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models