Jiang, Yiwei
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Zhang, Wenqi, Wang, Mengna, Liu, Gangao, Huixin, Xu, Jiang, Yiwei, Shen, Yongliang, Hou, Guiyang, Zheng, Zhe, Zhang, Hang, Li, Xin, Lu, Weiming, Li, Peng, Zhuang, Yueting
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
A Preliminary Add-on Differential Drive System for MRI-Compatible Prostate Robotic System
Zhao, Zhanyue, Jiang, Yiwei, Bales, Charles, Wang, Yang, Fischer, Gregory
MRI-targeted biopsy has shown significant advantages over conventional random sextant biopsy, detecting more clinically significant cancers and improving risk stratification. However, needle targeting accuracy, especially in transperineal MRI-guided biopsies, presents a challenge due to needle deflection. This can negatively impact patient outcomes, leading to repeated sampling and inaccurate diagnoses if cancerous tissue isn't properly collected. To address this, we developed a novel differential drive prototype designed to improve needle control and targeting precision. This system, featuring a 2-degree-of-freedom (2-DOF) MRI-compatible cooperative needle driver, distances the robot from the MRI imaging area, minimizing image artifacts and distortions. By using two motors for simultaneous needle insertion and rotation without relative movement, the design reduces MRI interference. In this work, we introduced two mechanical differential drive designs: the ball screw/spline and lead screw/bushing types, and explored both hollow-type and side-pulley differentials. Validation through low-resolution rapid-prototyping demonstrated the feasibility of differential drives in prostate biopsies, with the custom hollow-type hybrid ultrasonic motor (USM) achieving a rotary speed of 75 rpm. The side-pulley differential further increased the speed to 168 rpm, ideal for needle rotation applications. Accuracy assessments showed minimal errors in both insertion and rotation motions, indicating that this proof-of-concept design holds great promise for further development. Ultimately, the differential drive offers a promising solution to the critical issue of needle targeting accuracy in MRI-guided prostate biopsies.
Multimodal Large Language Model Driven Scenario Testing for Autonomous Vehicles
Lu, Qiujing, Wang, Xuanhan, Jiang, Yiwei, Zhao, Guangming, Ma, Mingyue, Feng, Shuo
The generation of corner cases has become increasingly crucial for efficiently testing autonomous vehicles prior to road deployment. However, existing methods struggle to accommodate diverse testing requirements and often lack the ability to generalize to unseen situations, thereby reducing the convenience and usability of the generated scenarios. A method that facilitates easily controllable scenario generation for efficient autonomous vehicles (AV) testing with realistic and challenging situations is greatly needed. To address this, we proposed OmniTester: a multimodal Large Language Model (LLM) based framework that fully leverages the extensive world knowledge and reasoning capabilities of LLMs. OmniTester is designed to generate realistic and diverse scenarios within a simulation environment, offering a robust solution for testing and evaluating AVs. In addition to prompt engineering, we employ tools from Simulation of Urban Mobility to simplify the complexity of codes generated by LLMs. Furthermore, we incorporate Retrieval-Augmented Generation and a self-improvement mechanism to enhance the LLM's understanding of scenarios, thereby increasing its ability to produce more realistic scenes. In the experiments, we demonstrated the controllability and realism of our approaches in generating three types of challenging and complex scenarios. Additionally, we showcased its effectiveness in reconstructing new scenarios described in crash report, driven by the generalization capability of LLMs.
Suturing Tasks Automation Based on Skills Learned From Demonstrations: A Simulation Study
Zhou, Haoying, Jiang, Yiwei, Gao, Shang, Wang, Shiyue, Kazanzides, Peter, Fischer, Gregory S.
In this work, we develop an open-source surgical simulation environment that includes a realistic model obtained by MRI-scanning a physical phantom, for the purpose of training and evaluating a Learning from Demonstration (LfD) algorithm for autonomous suturing. The LfD algorithm utilizes Dynamic Movement Primitives (DMP) and Locally Weighted Regression (LWR), but focuses on the needle trajectory, rather than the instruments, to obtain better generality with respect to needle grasps. We conduct a user study to collect multiple suturing demonstrations and perform a comprehensive analysis of the ability of the LfD algorithm to generalize from a demonstration at one location in one phantom to different locations in the same phantom and to a different phantom. Our results indicate good generalization, on the order of 91.5%, when learning from more experienced subjects, indicating the need to integrate skill assessment in the future.