AITopics | contextvla

Collaborating Authors

contextvla

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

Jang, Huiwon, Yu, Sihyun, Kwon, Heeseung, Jeon, Hojin, Seo, Younggyo, Shin, Jinwoo

arXiv.org Artificial IntelligenceOct-7-2025

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times. Many robotic tasks are inherently non-Markovian, i.e., the optimal decision at a given timestep t cannot be determined from the latest observation o For instance, an object may become occluded during manipulation (Shi et al., 2025). Solving long-horizon tasks may also require context about the previous motions of a robot, and handling dynamic environments often involves tracking the motion trajectories of moving objects (Zhang et al., 2025; Nasiriany et al., 2024).

artificial intelligence, arxiv preprint arxiv, contextvla, (13 more...)

arXiv.org Artificial Intelligence

2510.04246

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback