Qiu, Weichao
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation
Zhang, Lingfeng, Liu, Yuecheng, Zhang, Zhanguang, Aghaei, Matin, Hu, Yaochen, Gu, Hongjian, Alomrani, Mohammad Ali, Bravo, David Gamaliel Arcos, Karimi, Raika, Hamidizadeh, Atia, Xu, Haoping, Huang, Guowei, Zhang, Zhanpeng, Cao, Tongtong, Qiu, Weichao, Quan, Xingyue, Hao, Jianye, Zhuang, Yuzheng, Zhang, Yingxue
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning
Liu, Yuecheng, Chi, Dafeng, Wu, Shiguang, Zhang, Zhanguang, Hu, Yaochen, Zhang, Lingfeng, Zhang, Yingxue, Wu, Shuang, Cao, Tongtong, Huang, Guowei, Huang, Helong, Tian, Guangjian, Qiu, Weichao, Quan, Xingyue, Hao, Jianye, Zhuang, Yuzheng
Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.
The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition
Kong, Lingdong, Xie, Shaoyuan, Hu, Hanjiang, Niu, Yaru, Ooi, Wei Tsang, Cottereau, Benoit R., Ng, Lai Xing, Ma, Yuexin, Zhang, Wenwei, Pan, Liang, Chen, Kai, Liu, Ziwei, Qiu, Weichao, Zhang, Wei, Cao, Xu, Lu, Hao, Chen, Ying-Cong, Kang, Caixin, Zhou, Xinning, Ying, Chengyang, Shang, Wentao, Wei, Xingxing, Dong, Yinpeng, Yang, Bo, Jiang, Shengyin, Ma, Zeliang, Ji, Dengyi, Li, Haiwen, Huang, Xingliang, Tian, Yu, Kou, Genghua, Jia, Fan, Liu, Yingfei, Wang, Tiancai, Li, Ying, Hao, Xiaoshuai, Yang, Yifan, Zhang, Hui, Wei, Mengchuan, Zhou, Yi, Zhao, Haimei, Zhang, Jing, Li, Jinke, He, Xiao, Cheng, Xiaoqiang, Zhang, Bingyang, Zhao, Lirong, Ding, Dianlei, Liu, Fangsheng, Yan, Yixiang, Wang, Hongming, Ye, Nanfei, Luo, Lun, Tian, Yubo, Zuo, Yiwei, Cao, Zhe, Ren, Yi, Li, Yunfan, Liu, Wenjie, Wu, Xun, Mao, Yifan, Li, Ming, Liu, Jian, Liu, Jiayang, Qin, Zihan, Chu, Cunxi, Xu, Jialei, Zhao, Wenbo, Jiang, Junjun, Liu, Xianming, Wang, Ziyan, Li, Chiwei, Li, Shilong, Yuan, Chendong, Yang, Songyue, Liu, Wentao, Chen, Peng, Zhou, Bin, Wang, Yubo, Zhang, Chi, Sun, Jianhang, Chen, Hai, Yang, Xiao, Wang, Lizhong, Fu, Dongyi, Lin, Yongchun, Yang, Huitong, Li, Haoang, Luo, Yadan, Cheng, Xianjing, Xu, Yong
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Yang, Chen, Li, Peihao, Zhou, Zanwei, Yuan, Shanxin, Liu, Bingbing, Yang, Xiaokang, Qiu, Weichao, Shen, Wei
We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.
Simulated Adversarial Testing of Face Recognition Models
Ruiz, Nataniel, Kortylewski, Adam, Qiu, Weichao, Xie, Cihang, Bargal, Sarah Adel, Yuille, Alan, Sclaroff, Stan
Most machine learning models are validated and tested on fixed datasets. This can give an incomplete picture of the capabilities and weaknesses of the model. Such weaknesses can be revealed at test time in the real world. The risks involved in such failures can be loss of profits, loss of time or even loss of life in certain critical applications. In order to alleviate this issue, simulators can be controlled in a fine-grained manner using interpretable parameters to explore the semantic image manifold. In this work, we propose a framework for learning how to test machine learning algorithms using simulators in an adversarial manner in order to find weaknesses in the model before deploying it in critical scenarios. We apply this model in a face recognition scenario. We are the first to show that weaknesses of models trained on real data can be discovered using simulated samples. Using our proposed method, we can find adversarial synthetic faces that fool contemporary face recognition models. This demonstrates the fact that these models have weaknesses that are not measured by commonly used validation datasets. We hypothesize that this type of adversarial examples are not isolated, but usually lie in connected components in the latent space of the simulator. We present a method to find these adversarial regions as opposed to the typical adversarial points found in the adversarial example literature.