F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Lv, Qi, Kong, Weijie, Li, Hao, Zeng, Jia, Qiu, Zherui, Qu, Delin, Song, Haoming, Chen, Qizhi, Deng, Xiang, Pang, Jiangmiao

Sep-10-2025–arXiv.org Artificial Intelligence

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to shortsighted behaviors and poor robustness in dynamic scenes. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Vision-Language-Action (VLA) models (Kim et al., 2024; Team et al., 2025a; Black et al., 2024) aim to equip robots with the ability to execute natural language instructions in visually rich environments. By aligning language instructions with perceptual inputs and mapping them to actions, such models enable language-guide manipulation and versatile human-robot interaction. However, reliable performance in realistic settings remains elusive: environments are inherently dynamic, i.e., objects move, contexts shift, and instructions unfold over time, so robots must ground ambiguous language, handle diverse objects, and maintain long-horizon temporal coherence as scenes evolve. These conditions expose a core limitation of purely reactive state-to-action mappings: without predictive foresight about likely future states, policies become short-sighted and brittle under distribution shifts. Previous efforts on manipulation policy learning can be broadly grouped into three paradigms, as illustrated in Figure 1. The earliest line of work employs only an action expert trained end-to-end from observations to low-level actions (Zhao et al., 2023; Chi et al., 2023), but such purely reactive mappings lack semantic grounding and generalization across tasks and embodiments (Figure 1(a)). The earliest end-to-end manipulation policies are illustrated in Figure 1(a), such as ACT (Zhao et al., 2023) and DP (Chi et al., 2023). There are also approaches, as seen in Figure 1(c), e.g., VPP (Hu et al., 2024) and Genie Envisioner (Liao et al., 2025b), that leverage video diffusion models to guide action execution through video prediction. As depicted in Figure 1(d), we adopts an integrated architecture of understanding, generation, and execution, empowering the action execution module with capabilities in both scene and instruction comprehension as well as dynamic temporal prediction.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Sep-10-2025

arXiv.org PDF

Add feedback

Country:
- Africa > Mali (0.04)
- Asia
  - China
    - Guangdong Province > Shenzhen (0.04)
    - Heilongjiang Province > Harbin (0.04)
    - Shanghai > Shanghai (0.04)
  - Japan > Honshū
    - Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
  - Middle East > Jordan (0.04)
- Europe > Monaco (0.04)
- North America
  - Montserrat (0.04)
  - United States (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Natural Language > Large Language Model (0.67)
  - Robots (1.00)