A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Open in new window