VinVL: Making Visual Representations Matter in Vision-Language Models

Open in new window