Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning
Zbinden, Lukas, Nelson, Nigel, Chen, Juo-Tung, Chen, Xinhao, Kim, Ji Woong, Azizian, Mahdi, Krieger, Axel, Huver, Sean
–arXiv.org Artificial Intelligence
We propose a framework for automated policy evaluation using Cosmos-Surg-dVRK, a Cosmos world foundation model (WFM) finetune, to perform simulated surgical policy rollouts and subsequent automated success rate evaluation using a video classifier. After rollout completion, the generated video is automatically labeled for task success or failure using a trained video classifier, enabling objective selection of the most promising surgical policies for real-robot evaluation and deployment. The rise of surgical robots and vision-language-action models has accelerated the development of autonomous surgical policies and efficient assessment strategies. However, evaluating these policies directly on physical robotic platforms such as the da Vinci Research Kit (dVRK) remains hindered by high costs, time demands, reproducibility challenges, and variability in execution. World foundation models (WFM) for physical AI offer a transformative approach to simulate complex real-world surgical tasks, such as soft tissue deformation, with high fidelity. This work introduces Cosmos-Surg-dVRK, a surgical finetune of the Cosmos WFM, which, together with a trained video classifier, enables fully automated online evaluation and benchmarking of surgical policies. On tabletop suture pad tasks, the automated pipeline achieves strong correlation between online rollouts in Cosmos-Surg-dVRK and policy outcomes on the real dVRK Si platform, as well as good agreement between human labelers and the V -JEP A 2-derived video classifier. Additionally, preliminary experiments with ex-vivo porcine cholecystectomy tasks in Cosmos-Surg-dVRK demonstrate promising alignment with real-world evaluations, highlighting the platform's potential for more complex surgical procedures. World models have emerged as a foundational approach for enabling intelligent agents to understand and act in complex simulated environments. Building on early work by Ha & Schmidhuber (2018), they learn compact latent representations of spatial and temporal dynamics and predict the consequences of actions in context. Leveraging diffusion processes, recent efforts have introduced large-scale multi-modal world foundation models (WFMs) (Wan et al., 2025; Agarwal et al., 2025; Kong et al., 2024; Xie et al., 2025; Ball et al., 2025). These video generative models serve as scalable, general-purpose learned simulators that encode and synthesize diverse physical phenomena, approximate scene dynamics, render plausible sensory observations, and facilitate policy evaluation and training, enabling progress on sim-to-real transfer. At the intersection of these advances lies the burgeoning field of physical AI, AI systems equipped with sensors and actuators that enable perception, reasoning, and actuation in the physical world.
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Representation & Reasoning (1.00)
- Robots (1.00)
- Information Technology > Artificial Intelligence