Goto

Collaborating Authors

 real robot



DIJIT: A Robotic Head for an Active Observer

Tabrizi, Mostafa Kamali, Chi, Mingshi, Dey, Bir Bikram, Yuan, Yu Qing, Solbach, Markus D., Liu, Yiqian, Jenkin, Michael, Tsotsos, John K.

arXiv.org Artificial Intelligence

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. The exploration of the utility of these to both human and machine vision is ongoing. Here, we present the design of DIJIT and evaluate aspects of its performance. We present a new method for saccadic camera movements. In this method, a direct relationship between camera orientation and motor values is developed. The resulting saccadic camera movements are close to human movements in terms of their accuracy.



RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI

Tai, Cong, Zheng, Zhaoyu, Long, Haixu, Wu, Hansheng, Xiang, Haodong, Long, Zhengbin, Xiong, Jun, Shi, Rong, Zhang, Shizhuang, Qiu, Gang, Wang, He, Li, Ruifeng, Huang, Jun, Chang, Bin, Feng, Shuai, Shen, Tao

arXiv.org Artificial Intelligence

Abstract-- The emerging field of Vision-Language-Action (VLA) for humanoid robots faces several fundamental challenges, including the high cost of data acquisition, the lack of a standardized benchmark, and the significant gap between simulation and the real world. T o overcome these obstacles, we propose RealMirror, a comprehensive, open-source embodied AI VLA platform. RealMirror builds an efficient, low-cost data collection, model training, and inference system that enables end-to-end VLA research without requiring a real robot. T o facilitate model evolution and fair comparison, we also introduce a dedicated VLA benchmark for humanoid robots, featuring multiple scenarios, extensive trajectories, and various VLA models. Jun Xiong is with The Chinese University of Hong Kong, Shenzhen, China. In conclusion, with the unification of these critical components, RealMirror provides a robust framework that significantly accelerates the development of VLA models for humanoid robots. I. INTRODUCTION The rapid evolution of Large Language Models (LLMs) like GPT [1], Qwen [2], and Deepseek [3] has significantly advanced the development of Artificial General Intelligence (AGI). While exhibiting remarkable model performance, they lack the ability to perform tasks in the real world.


Visuomotor Grasping with World Models for Surgical Robots

Lin, Hongbin, Li, Bin, Au, Kwok Wai Samuel

arXiv.org Artificial Intelligence

Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair -- the standard RAS setup, and (iii) object-agnostic grasping with a single policy that generalizes to diverse, unseen surgical objects without retraining or task-specific models. We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping. GASv2 leverages a world-model-based architecture and a surgical perception pipeline for visual observations, combined with a hybrid control system for safe execution. We train the policy in simulation using domain randomization for sim-to-real transfer and deploy it on a real robot in both phantom-based and ex vivo surgical settings, using only a single pair of endoscopic cameras. Extensive experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances, demonstrating strong performance, generality, and robustness.


Tackling the 3D Simulation League: an interview with Klaus Dorer and Stefan Glaser

AIHub

A screenshot from the new simulator that will be trialled for a special challenge at RoboCup2025. The annual RoboCup event, where teams gather from across the globe to take part in competitions across a number of leagues, will this year take place in Brazil, from 15-21 July. In advance of kick-off, we spoke to two members of the RoboCup Soccer 3D Simulation League: Executive Committee Member Klaus Dorer, and Stefan Glaser, who is on the Maintenance Committee and who has been recently developing a new simulator for the League. Could start by just giving us a quick introduction to the Simulation League? Klaus Dorer: There are two Simulation Leagues in Soccer: the 2D Simulation League and the 3D Simulation League. The 2D Simulation League, as the name suggests, is a flat league where the players and ball are simulated with simplified physics and the main focus is on team strategy.


Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin

Abou-Chakra, Jad, Sun, Lingfeng, Rana, Krishan, May, Brandon, Schmeckpeper, Karl, Suenderhauf, Niko, Minniti, Maria Vittoria, Herlant, Laura

arXiv.org Artificial Intelligence

We introduce real-is-sim, a new approach to integrating simulation into behavior cloning pipelines. In contrast to real-only methods, which lack the ability to safely test policies before deployment, and sim-to-real methods, which require complex adaptation to cross the sim-to-real gap, our framework allows policies to seamlessly switch between running on real hardware and running in parallelized virtual environments. At the center of real-is-sim is a dynamic digital twin, powered by the Embodied Gaussian simulator, that synchronizes with the real world at 60Hz. This twin acts as a mediator between the behavior cloning policy and the real robot. Policies are trained using representations derived from simulator states and always act on the simulated robot, never the real one. During deployment, the real robot simply follows the simulated robot's joint states, and the simulation is continuously corrected with real world measurements. This setup, where the simulator drives all policy execution and maintains real-time synchronization with the physical world, shifts the responsibility of crossing the sim-to-real gap to the digital twin's synchronization mechanisms, instead of the policy itself. We demonstrate real-is-sim on a long-horizon manipulation task (PushT), showing that virtual evaluations are consistent with real-world results. We further show how real-world data can be augmented with virtual rollouts and compare to policies trained on different representations derived from the simulator state including object poses and rendered images from both static and robot-mounted cameras. Our results highlight the flexibility of the real-is-sim framework across training, evaluation, and deployment stages. Videos available at https://real-is-sim.github.io.


PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Gundawar, Atharva, Sagar, Som, Senanayake, Ransalu

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, and 1 to 3 affordances defined per class), 100 real-world humanoid-view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation and pointing to key areas for targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating physical reasoning in VLMs and guiding the development of more robust, physically grounded models for robotic applications. Project Page: https://pacbench.github.io/


RoboTwin: A Robotic Teleoperation Framework Using Digital Twins

Yelchuri, Harsha, Singh, Diwakar Kumar, Gnani, Nithish Krishnabharathi, Prabhakar, T V, Singh, Chandramani

arXiv.org Artificial Intelligence

--Robotic surgery imposes a significant cognitive burden on the surgeon. This cognitive burden increases in the case of remote robotic surgeries due to latency between entities and thus might affect the quality of surgery. Here, the patient side and the surgeon side are geographically separated by hundreds to thousands of kilometres. Real-time teleoperation of robots requires strict latency bounds for control and feedback. We propose a dual digital twin (DT) framework and explain the simulation environment and teleoperation framework. Here, the doctor visually controls the locally available DT of the patient side and thus experiences minimum latency. The second digital twin serves two purposes. Firstly, it provides a layer of safety for operator-related mishaps, and secondly, it conveys the coordinates of known and unknown objects back to the operator's side digital twin. We show that teleoperation accuracy and user experience are enhanced with our approach. Experimental results using the NASA-TLX metric show that the quality of surgery is vastly improved with DT, perhaps due to reduced cognitive burden. The network data rate for identifying objects at the operator side is 25x lower than normal.


Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning

Singh, Rohan P., Morisawa, Mitsuharu, Benallegue, Mehdi, Xie, Zhaoming, Kanehiro, Fumio

arXiv.org Artificial Intelligence

Email: rohan-singh@aist.go.jp Abstract -- For the deployment of legged robots in real-world environments, it is essential to develop robust locomotion control methods for challenging terrains that may exhibit unexpected deformability and irregularity. In this paper, we explore the application of sim-to-real deep reinforcement learning (RL) for the design of bipedal locomotion controllers for humanoid robots on compliant and uneven terrains. Our key contribution is to show that a simple training curriculum for exposing the RL agent to randomized terrains in simulation can achieve robust walking on a real humanoid robot using only proprioceptive feedback. We train an end-to-end bipedal locomotion policy using the proposed approach, and show extensive real-robot demonstration on the HRP-5P humanoid over several difficult terrains inside and outside the lab environment. Further, we argue that the robustness of a bipedal walking policy can be improved if the robot is allowed to exhibit aperiodic motion with variable stepping frequency. We propose a new control policy to enable modification of the observed clock signal, leading to adaptive gait frequencies depending on the terrain and command velocity. Through simulation experiments, we show the effectiveness of this policy specifically for walking over challenging terrains by controlling swing and stance durations. This is primarily due to the strict temporal and spatial assumptions placed by such approaches on the foot trajectories and environmental contacts [1], [2]. When faced with an irregular or compliant (i.e.