Goto

Collaborating Authors

 policy head


SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Wu, Shengkai, Yang, Jinrong, Luo, Wenqiu, Gao, Linfeng, Shang, Chaohui, Zhi, Meiyu, Sun, Mingshan, Yang, Fangping, Ren, Liangliang, Zhao, Yong

arXiv.org Artificial Intelligence

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.




VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Wang, Yixiao, Huo, Mingxiao, Liang, Zhixuan, Du, Yushi, Sun, Lingfeng, Lin, Haotian, Shang, Jinghuan, Peng, Chensheng, Bansal, Mohit, Ding, Mingyu, Tomizuka, Masayoshi

arXiv.org Artificial Intelligence

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.


CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation

Lee, Sung-Wook, Kang, Xuhui, Yang, Brandon, Kuo, Yen-Ling

arXiv.org Artificial Intelligence

Behavior Cloning (BC) has demonstrated strong performance in robotic manipulation by leveraging expressive models and action sequence modeling. Efforts to improve BC have focused on large-scale dataset collection [1, 2] and advances in model architectures [3, 4, 5] to better capture the complex distribution of demonstration data. However, expressive policies often struggle to generalize, especially when trained on demonstrations collected under heterogeneous conditions--that is, where the policy must adapt to additional properties not present in homogeneous data, such as changes in viewpoint or object appearance [6, 7]. This suggests a tendency to overfit individual actions and a limited ability to capture shared structure across demonstrations [8]. To address this, we propose Contrastive Learning via Action Sequence Supervision (CLASS), a framework for learning behaviorally grounded representations from demonstrations using supervised contrastive learning. Rather than relying on direct action prediction, CLASS supervises the encoder by aligning observations based on action sequence similarity, measured via Dynamic Time Warping (DTW), encouraging states that lead to similar future behaviors to cluster in the latent space. This weak supervision enables the model to capture shared structure across demonstrations, improving robustness to variations in visual conditions such as camera pose and object appearance. The learned representation supports both retrieval-based inference and policy fine-tuning, and consistently improves performance across both homogeneous and heterogeneous data settings. Across a range of simulated and real-world robotic manipulation tasks, CLASS achieves strong gains over behavior cloning and representation learning baselines, demonstrating its ability to learn more transferable and composable behavioral representations.


A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems

She, Yuanya

arXiv.org Artificial Intelligence

In this paper, we propose a robust and reinforcement-learning-enhanced network intrusion detection system (NIDS) designed for class-imbalanced and few-shot attack scenarios in Industrial Internet of Things (IIoT) environments. Our model integrates a TabTransformer for effective tabular feature representation with Proximal Policy Optimization (PPO) to optimize classification decisions via policy learning. Evaluated on the TON\textunderscore IoT benchmark, our method achieves a macro F1-score of 97.73\% and accuracy of 98.85\%. Remarkably, even on extremely rare classes like man-in-the-middle (MITM), our model achieves an F1-score of 88.79\%, showcasing strong robustness and few-shot detection capabilities. Extensive ablation experiments confirm the complementary roles of TabTransformer and PPO in mitigating class imbalance and improving generalization. These results highlight the potential of combining transformer-based tabular learning with reinforcement learning for real-world NIDS applications.


RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

Wang, Sheng

arXiv.org Artificial Intelligence

As robotic technologies advancing towards more complex multimodal interactions and manipulation tasks, the integration of advanced Vision-Language Models (VLMs) has become a key driver in the field. Despite progress with current methods, challenges persist in fusing depth and RGB information within 3D environments and executing tasks guided by linguistic instructions. In response to these challenges, we have enhanced the existing RoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporates depth data into VLMs to significantly improve robotic manipulation performance. Our research achieves a nuanced fusion of RGB and depth information by integrating a pre-trained Vision Transformer (ViT) with a resampling technique, closely aligning this combined data with linguistic cues for superior multimodal understanding. The novelty of RoboFlamingo-Plus lies in its adaptation of inputs for depth data processing, leveraging a pre-trained resampler for depth feature extraction, and employing cross-attention mechanisms for optimal feature integration. These improvements allow RoboFlamingo-Plus to not only deeply understand 3D environments but also easily perform complex, language-guided tasks in challenging settings. Experimental results show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% over current methods, marking a significant advancement. Codes and model weights are public at RoboFlamingo-Plus.


Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Li, Xinghang, Li, Peiyan, Liu, Minghuan, Wang, Dong, Liu, Jirong, Kang, Bingyi, Ma, Xiao, Kong, Tao, Zhang, Hanbo, Liu, Huaping

arXiv.org Artificial Intelligence

By injecting action components into the VLMs, Vision-Language-Action models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research.


Multi-Objective Alignment of Large Language Models Through Hypervolume Maximization

Mukherjee, Subhojyoti, Lalitha, Anusha, Sengupta, Sailik, Deshmukh, Aniket, Kveton, Branislav

arXiv.org Artificial Intelligence

Multi-objective alignment from human feedback (MOAHF) in large language models (LLMs) is a challenging problem as human preferences are complex, multifaceted, and often conflicting. Recent works on MOAHF considered a-priori multi-objective optimization (MOO), where human preferences are known at training or inference time. In contrast, when human preferences are unknown or difficult to quantify, a natural approach is to cover the Pareto front by multiple diverse solutions. We propose an algorithm HaM for learning diverse LLM policies that maximizes their hypervolume. This is the first application of a-posteriori MOO to MOAHF. HaM is computationally and space efficient, and empirically superior across objectives such as harmlessness, helpfulness, humor, faithfulness, and hallucination, on various datasets.


Effective Tuning Strategies for Generalist Robot Manipulation Policies

Zhang, Wenbo, Li, Yang, Qiao, Yanyuan, Huang, Siyuan, Liu, Jiajun, Dayoub, Feras, Ma, Xiao, Liu, Lingqiao

arXiv.org Artificial Intelligence

Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, devices, and environments. However, existing policies continue to struggle with out-of-distribution scenarios due to the inherent difficulty of collecting sufficient action data to cover extensively diverse domains. While fine-tuning offers a practical way to quickly adapt a GMPs to novel domains and tasks with limited samples, we observe that the performance of the resulting GMPs differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an in-depth empirical study to investigate the effect of key factors in GMPs fine-tuning strategies, covering the action space, policy head, supervision signal and the choice of tunable parameters, where 2,500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMPs fine-tuning. We observe that in a low-data regime, with carefully chosen fine-tuning strategies, a GMPs significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs, and provide a significant addition to the GMPs toolbox for the community.