A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models