Yao, Yuanqi
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Qu, Delin, Song, Haoming, Chen, Qizhi, Yao, Yuanqi, Ye, Xinyi, Ding, Yan, Wang, Zhigang, Gu, JiaYuan, Zhao, Bin, Wang, Dong, Li, Xuelong
In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Gu, Tianle, Zhou, Zeyang, Huang, Kexin, Liang, Dandan, Wang, Yixu, Zhao, Haiquan, Yao, Yuanqi, Qiao, Xingge, Wang, Keqing, Yang, Yujiu, Teng, Yan, Qiao, Yu, Wang, Yingchun
Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation
Kong, Lingdong, Niu, Yaru, Xie, Shaoyuan, Hu, Hanjiang, Ng, Lai Xing, Cottereau, Benoit R., Zhao, Ding, Zhang, Liangjun, Wang, Hesheng, Ooi, Wei Tsang, Zhu, Ruijie, Song, Ziyang, Liu, Li, Zhang, Tianzhu, Yu, Jun, Jing, Mohan, Li, Pengwei, Qi, Xiaohua, Jin, Cheng, Chen, Yingfeng, Hou, Jie, Zhang, Jie, Kan, Zhen, Ling, Qiang, Peng, Liang, Li, Minglei, Xu, Di, Yang, Changpeng, Yao, Yuanqi, Wu, Gang, Kuai, Jian, Liu, Xianming, Jiang, Junjun, Huang, Jiamian, Li, Baojun, Chen, Jiale, Zhang, Shuang, Ao, Sun, Li, Zhenyu, Chen, Runze, Luo, Haiyong, Zhao, Fang, Yu, Jingze
Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.