Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

Yang, Jiashu, Han, Yifan, Xie, Yucheng, Guo, Ning, Lian, Wenzhao

Nov-20-2025–arXiv.org Artificial Intelligence

In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. T o address this issue, we propose EyeVLA,a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autore-gressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data.Experiments show that EyeVLA can effectively understand scenes in real-world environments and actively acquire more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision paradigm: under pixel and spatial budgets, it dynamically acquires dynamically acquires highly informative visual data within given pixel and spatial budgets for environmental perception in multimodal autonomous systems.

machine learning, natural language, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

Nov-20-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found