Embodied Scene Understanding for Vision Language Models via MetaVQA