Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Open in new window