Distilling Internet-Scale Vision-Language Models into Embodied Agents

Open in new window