VinVL: Making Visual Representations Matter in Vision-Language Models