Is Multimodal Vision Supervision Beneficial to Language?

Open in new window