Ding, Hao
dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale
Liu, Yihao, Ku, Yu-Chun, Zhang, Jiaming, Ding, Hao, Kazanzides, Peter, Armand, Mehran
Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system's efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency.
OPG-Policy: Occluded Push-Grasp Policy Learning with Amodal Segmentation
Ding, Hao, Zeng, Yiming, Wan, Zhaoliang, Cheng, Hui
Goal-oriented grasping in dense clutter, a fundamental challenge in robotics, demands an adaptive policy to handle occluded target objects and diverse configurations. Previous methods typically learn policies based on partially observable segments of the occluded target to generate motions. However, these policies often struggle to generate optimal motions due to uncertainties regarding the invisible portions of different occluded target objects across various scenes, resulting in low motion efficiency. To this end, we propose OPG-Policy, a novel framework that leverages amodal segmentation to predict occluded portions of the target and develop an adaptive push-grasp policy for cluttered scenarios where the target object is partially observed. Specifically, our approach trains a dedicated amodal segmentation module for diverse target objects to generate amodal masks. These masks and scene observations are mapped to the future rewards of grasp and push motion primitives via deep Q-learning to learn the motion critic. Afterward, the push and grasp motion candidates predicted by the critic, along with the relevant domain knowledge, are fed into the coordinator to generate the optimal motion implemented by the robot. Extensive experiments conducted in both simulated and real-world environments demonstrate the effectiveness of our approach in generating motion sequences for retrieving occluded targets, outperforming other baseline methods in success rate and motion efficiency.
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, null, Du, Angang, Gao, Bofei, Xing, Bowei, Jiang, Changjiu, Chen, Cheng, Li, Cheng, Xiao, Chenjun, Du, Chenzhuang, Liao, Chonghua, Tang, Chuning, Wang, Congcong, Zhang, Dehao, Yuan, Enming, Lu, Enzhe, Tang, Fengxiang, Sung, Flood, Wei, Guangda, Lai, Guokun, Guo, Haiqing, Zhu, Han, Ding, Hao, Hu, Hao, Yang, Hao, Zhang, Hao, Yao, Haotian, Zhao, Haotian, Lu, Haoyu, Li, Haoze, Yu, Haozhen, Gao, Hongcheng, Zheng, Huabin, Yuan, Huan, Chen, Jia, Guo, Jianhang, Su, Jianlin, Wang, Jianzhou, Zhao, Jie, Zhang, Jin, Liu, Jingyuan, Yan, Junjie, Wu, Junyan, Shi, Lidong, Ye, Ling, Yu, Longhui, Dong, Mengnan, Zhang, Neo, Ma, Ningchen, Pan, Qiwei, Gong, Qucheng, Liu, Shaowei, Ma, Shengling, Wei, Shupeng, Cao, Sihan, Huang, Siying, Jiang, Tao, Gao, Weihao, Xiong, Weimin, He, Weiran, Huang, Weixiao, Wu, Wenhao, He, Wenyang, Wei, Xianghui, Jia, Xianqing, Wu, Xingzhe, Xu, Xinran, Zu, Xinxing, Zhou, Xinyu, Pan, Xuehai, Charles, Y., Li, Yang, Hu, Yangyang, Liu, Yangyang, Chen, Yanru, Wang, Yejie, Liu, Yibo, Qin, Yidao, Liu, Yifeng, Yang, Ying, Bao, Yiping, Du, Yulun, Wu, Yuxin, Wang, Yuzhi, Zhou, Zaida, Wang, Zhaoji, Li, Zhaowei, Zhu, Zhen, Zhang, Zheng, Wang, Zhexu, Yang, Zhilin, Huang, Zhiqi, Huang, Zihao, Xu, Ziyao, Yang, Zonghan
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).
Language-Model Prior Overcomes Cold-Start Items
Wang, Shiyu, Ding, Hao, Gu, Yupeng, Aydore, Sergul, Kalantari, Kousha, Kveton, Branislav
The growth of recommender systems (RecSys) is driven by digitization and the need for personalized content in areas such as e-commerce and video streaming. The content in these systems often changes rapidly and therefore they constantly face the ongoing cold-start problem, where new items lack interaction data and are hard to value. Existing solutions for the cold-start problem, such as content-based recommenders and hybrid methods, leverage item metadata to determine item similarities. The main challenge with these methods is their reliance on structured and informative metadata to capture detailed item similarities, which may not always be available. This paper introduces a novel approach for cold-start item recommendation that utilizes the language model (LM) to estimate item similarities, which are further integrated as a Bayesian prior with classic recommender systems. This approach is generic and able to boost the performance of various recommenders. Specifically, our experiments integrate it with both sequential and collaborative filtering-based recommender and evaluate it on two real-world datasets, demonstrating the enhanced performance of the proposed approach.
Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data
Wang, Can, Sui, Dianbo, Sun, Hongliang, Ding, Hao, Zhang, Bolin, Tu, Zhiying
However, the success of ICL varies depending on the task and context, leading to heterogeneous service quality. Directly estimating the performance of LLM services at each invocation can be laborious, especially requiring abundant labeled data or internal information within the LLM. This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play" utilizing only a few unlabeled samples like ICL. Our findings suggest that the negative log-likelihood and perplexity derived from LLM service invocation can function as effective and significant features. Based on these features, we utilize four distinct meta-models to estimate the performance of LLM services. Our proposed method is compared against unlabeled estimation baselines across multiple LLM services and tasks. And it is experimentally applied to two scenarios, demonstrating its effectiveness in the selection and further optimization of LLM services.
Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models
Ding, Hao, Seenivasan, Lalithkumar, Shu, Hongchao, Byrd, Grayson, Zhang, Han, Xiao, Pu, Barragan, Juan Antonio, Taylor, Russell H., Kazanzides, Peter, Unberath, Mathias
Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments but lack the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our digital twin-based scene representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environment settings. Despite convincing performance, this work is merely a first step towards the integration of digital twin-based scene representations. Future studies are necessary for the realization of a comprehensive digital twin framework to improve the interpretability and generalizability of embodied intelligence in surgery.
SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge
Ding, Hao, Lu, Tuxun, Zhang, Yuqian, Liang, Ruixing, Shu, Hongchao, Seenivasan, Lalithkumar, Long, Yonghao, Dou, Qi, Gao, Cong, Unberath, Mathias
Accurate segmentation of tools in robot-assisted surgery is critical for machine perception, as it facilitates numerous downstream tasks including augmented reality feedback. While current feed-forward neural network-based methods exhibit excellent segmentation performance under ideal conditions, these models have proven susceptible to even minor corruptions, significantly impairing the model's performance. This vulnerability is especially problematic in surgical settings where predictions might be used to inform high-stakes decisions. To better understand model behavior under non-adversarial corruptions, prior work has explored introducing artificial corruptions, like Gaussian noise or contrast perturbation to test set images, to assess model robustness. However, these corruptions are either not photo-realistic or model/task agnostic. Thus, these investigations provide limited insights into model deterioration under realistic surgical corruptions. To address this limitation, we introduce the SegSTRONG-C challenge that aims to promote the development of algorithms robust to unforeseen but plausible image corruptions of surgery, like smoke, bleeding, and low brightness. We collect and release corruption-free mock endoscopic video sequences for the challenge participants to train their algorithms and benchmark them on video sequences with photo-realistic non-adversarial corruptions for a binary robot tool segmentation task. This new benchmark will allow us to carefully study neural network robustness to non-adversarial corruptions of surgery, thus constituting an important first step towards more robust models for surgical computer vision. In this paper, we describe the data collection and annotation protocol, baseline evaluations of established segmentation models, and data augmentation-based techniques to enhance model robustness.
Logic-Scaffolding: Personalized Aspect-Instructed Recommendation Explanation Generation using LLMs
Rahdari, Behnam, Ding, Hao, Fan, Ziwei, Ma, Yifei, Chen, Zhuotong, Deoras, Anoop, Kveton, Branislav
The unique capabilities of Large Language Models (LLMs), such as the natural language text generation ability, position them as strong candidates for providing explanation for recommendations. However, despite the size of the LLM, most existing models struggle to produce zero-shot explanations reliably. To address this issue, we propose a framework called Logic-Scaffolding, that combines the ideas of aspect-based explanation and chain-of-thought prompting to generate explanations through intermediate reasoning steps. In this paper, we share our experience in building the framework and present an interactive demonstration for exploring our results.
Pre-trained Recommender Systems: A Causal Debiasing Perspective
Lin, Ziqian, Ding, Hao, Hoang, Nghia Trong, Kveton, Branislav, Deoras, Anoop, Wang, Hao
Recent studies on pre-trained vision/language models have demonstrated the practical benefit of a new, promising solution-building paradigm in AI where models can be pre-trained on broad data describing a generic task space and then adapted successfully to solve a wide range of downstream tasks, even when training data is severely limited (e.g., in zero- or few-shot learning scenarios). Inspired by such progress, we investigate in this paper the possibilities and challenges of adapting such a paradigm to the context of recommender systems, which is less investigated from the perspective of pre-trained model. In particular, we propose to develop a generic recommender that captures universal interaction patterns by training on generic user-item interaction data extracted from different domains, which can then be fast adapted to improve few-shot learning performance in unseen new domains (with limited data). However, unlike vision/language data which share strong conformity in the semantic space, universal patterns underlying recommendation data collected across different domains (e.g., different countries or different E-commerce platforms) are often occluded by both in-domain and cross-domain biases implicitly imposed by the cultural differences in their user and item bases, as well as their uses of different e-commerce platforms. As shown in our experiments, such heterogeneous biases in the data tend to hinder the effectiveness of the pre-trained model. To address this challenge, we further introduce and formalize a causal debiasing perspective, which is substantiated via a hierarchical Bayesian deep learning model, named PreRec. Our empirical studies on real-world data show that the proposed model could significantly improve the recommendation performance in zero- and few-shot learning settings under both cross-market and cross-platform scenarios.
Distilling Functional Rearrangement Priors from Large Models
Zeng, Yiming, Wu, Mingdong, Yang, Long, Zhang, Jiyao, Ding, Hao, Cheng, Hui, Dong, Hao
Object rearrangement, a fundamental challenge in robotics, demands versatile strategies to handle diverse objects, configurations, and functional needs. To achieve this, the AI robot needs to learn functional rearrangement priors in order to specify precise goals that meet the functional requirements. Previous methods typically learn such priors from either laborious human annotations or manually designed heuristics, which limits scalability and generalization. In this work, we propose a novel approach that leverages large models to distill functional rearrangement priors. Specifically, our approach collects diverse arrangement examples using both LLMs and VLMs and then distills the examples into a diffusion model. During test time, the learned diffusion model is conditioned on the initial configuration and guides the positioning of objects to meet functional requirements. In this manner, we create a handshaking point that combines the strengths of conditional generative models and large models. Extensive experiments on multiple domains, including real-world scenarios, demonstrate the effectiveness of our approach in generating compatible goals for object rearrangement tasks, significantly outperforming baseline methods.