Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Bai, Yongjie, Wang, Zhouxia, Liu, Yang, Luo, Kaijun, Wen, Yifan, Dai, Mingtong, Chen, Weixing, Chen, Ziliang, Liu, Lingbo, Li, Guanbin, Lin, Liang
–arXiv.org Artificial Intelligence
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Genre:
- Research Report > New Finding (0.92)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.67)
- Natural Language > Large Language Model (0.46)
- Robots (1.00)
- Information Technology > Artificial Intelligence