CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Glossop, Catherine, Chen, William, Bhorkar, Arjun, Shah, Dhruv, Levine, Sergey

Aug-20-2025–arXiv.org Artificial Intelligence

Figure 1: CAST generates counterfactual action and language labels for uncurated robot trajectory datasets using off-the-shelf VLMs. We use this augmented dataset to train CounterfactualVLA, a navigation policy that can follow complex language instructions in the real world. Abstract -- Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. T o address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting visual language navigation experiments in 3 different indoor and outdoor environments. Our experiments demonstrate that counterfactual relabeling, without any additional data collection, significantly improves instruction-following in VLA policies, making them competitive with state-of-the-art methods and increasing success rate by 27% on navigation tasks. Large vision-language models (VLMs) are powerful not only because of their diverse capabilities but also because they can be steered with fine-grained instructions to produce specific outputs. Ideally, powerful generalist robot policies should exhibit the same level of controllability on embodied tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Aug-20-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)
  - Robots (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found