ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Open in new window