Embodied Scene Understanding for Vision Language Models via MetaVQA

Open in new window