Cheng, Kai
EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering
Cheng, Kai, Li, Zhengyuan, Sun, Xingpeng, Min, Byung-Cheol, Bedi, Amrit Singh, Bera, Aniket
Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering without embodied exploration or rely on closed-form choice sets. In real-world scenarios, a robotic agent must efficiently explore and accurately answer questions in open-vocabulary settings. To address these challenges, we propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering. In EfficientEQA, the robot actively explores unknown environments using Semantic-Value-Weighted Frontier Exploration, a strategy that prioritizes exploration based on semantic importance provided by calibrated confidence from black-box VLMs to quickly gather relevant information. To generate accurate answers, we employ Retrieval-Augmented Generation (RAG), which utilizes BLIP to retrieve useful images from accumulated observations and VLM reasoning to produce responses without relying on predefined answer choices. Additionally, we detect observations that are highly relevant to the question as outliers, allowing the robot to determine when it has sufficient information to stop exploring and provide an answer. Experimental results demonstrate the effectiveness of our approach, showing an improvement in answering accuracy by over 15% and efficiency, measured in running steps, by over 20% compared to state-of-the-art methods.
Sliced gradient-enhanced Kriging for high-dimensional function approximation
Cheng, Kai, Zimmermann, Ralf
Gradient-enhanced Kriging (GE-Kriging) is a well-established surrogate modelling technique for approximating expensive computational models. However, it tends to get impractical for high-dimensional problems due to the size of the inherent correlation matrix and the associated high-dimensional hyper-parameter tuning problem. To address these issues, a new method, called sliced GE-Kriging (SGE-Kriging), is developed in this paper for reducing both the size of the correlation matrix and the number of hyper-parameters. We first split the training sample set into multiple slices, and invoke Bayes' theorem to approximate the full likelihood function via a sliced likelihood function, in which multiple small correlation matrices are utilized to describe the correlation of the sample set rather than one large one. Then, we replace the original high-dimensional hyper-parameter tuning problem with a low-dimensional counterpart by learning the relationship between the hyper-parameters and the derivative-based global sensitivity indices. The performance of SGE-Kriging is finally validated by means of numerical experiments with several benchmarks and a high-dimensional aerodynamic modeling problem. The results show that the SGE-Kriging model features an accuracy and robustness that is comparable to the standard one but comes at much less training costs. The benefits are most evident for high-dimensional problems with tens of variables.
Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions
Cheng, Kai, Wu, Ruihai, Shen, Yan, Ning, Chuanruo, Zhan, Guanqi, Dong, Hao
Perceiving and manipulating 3D articulated objects in diverse environments is essential for home-assistant robots. Recent studies have shown that point-level affordance provides actionable priors for downstream manipulation tasks. However, existing works primarily focus on single-object scenarios with homogeneous agents, overlooking the realistic constraints imposed by the environment and the agent's morphology, e.g., occlusions and physical limitations. In this paper, we propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints. Unlike object-centric affordance approaches, learning environment-aware affordance faces the challenge of combinatorial explosion due to the complexity of various occlusions, characterized by their quantities, geometries, positions and poses. To address this and enhance data efficiency, we introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations. Experiments demonstrate the effectiveness of our proposed approach in learning affordance considering environment constraints.
The Second Monocular Depth Estimation Challenge
Spencer, Jaime, Qian, C. Stella, Trescakova, Michaela, Russell, Chris, Hadfield, Simon, Graf, Erich W., Adams, Wendy J., Schofield, Andrew J., Elder, James, Bowden, Richard, Anwar, Ali, Chen, Hao, Chen, Xiaozhi, Cheng, Kai, Dai, Yuchao, Hoa, Huynh Thai, Hossain, Sadat, Huang, Jianmian, Jing, Mohan, Li, Bo, Li, Chao, Li, Baojun, Liu, Zhiwen, Mattoccia, Stefano, Mercelis, Siegfried, Nam, Myungwoo, Poggi, Matteo, Qi, Xiaohua, Ren, Jiahui, Tang, Yang, Tosi, Fabio, Trinh, Linh, Uddin, S. M. Nadim, Umair, Khan Muhammad, Wang, Kaixuan, Wang, Yufei, Wang, Yixing, Xiang, Mochu, Xu, Guangkai, Yin, Wei, Yu, Jun, Zhang, Qi, Zhao, Chaoqiang
This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy.