FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction
Yang, Yifan, Duan, Zhixiang, Xie, Tianshi, Cao, Fuyu, Shen, Pinxi, Song, Peili, Jin, Piaopiao, Sun, Guokang, Xu, Shaoqing, You, Yangwei, Liu, Jingtai
–arXiv.org Artificial Intelligence
Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A dual-stream fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.
arXiv.org Artificial Intelligence
Dec-4-2025
- Country:
- Asia
- China
- Beijing > Beijing (0.04)
- Liaoning Province > Shenyang (0.04)
- Tianjin Province > Tianjin (0.04)
- Macao (0.04)
- China
- Asia
- Genre:
- Research Report > Promising Solution (0.34)
- Technology: