Wang, Kaixuan
Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Xu, Kechun, Xia, Xunlong, Wang, Kaixuan, Yang, Yifei, Mao, Yunxuan, Deng, Bing, Xiong, Rong, Wang, Yue
--We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at the project page. HE ability to pick and place objects is essential for robotic manipulation [1]-[6]. Consider a scenario where a robot is commanded with language instructions to grasp a target object in open clutter, and move it to a specified place. The target object may be partially or fully occluded, posing challenges for object grounding and grasping. In such scenarios, multiple pick and place actions may be needed to clear obstacles for object rearrangement. A common way to construct a policy for such tasks is to predict 6-DoF actions directly from raw sensory information, as in classic end-to-end policies. Recently, these policies have achieved promising performances by incorporating features of pre-trained foundation models, e.g., vision-language models (VLM) and large language models (LLM) [7]-[12]. However, they require large amounts of demonstration data for policy learning, particularly for tasks involving cluttered environments. In addition, one has to deal with generalization issues to deploy these policies in real-world applications. Kechun Xu is with Zhejiang University, Hangzhou, China, and Alibaba Cloud, Hangzhou, China. Xunlong Xia, and Bing Deng are with Alibaba Cloud, Hangzhou, China. Kaixuan Wang, Yifei Y ang, Y unxuan Mao, Rong Xiong, and Y ue Wang are with Zhejiang University, Hangzhou, China.
Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack
Guo, Zhongliang, Wang, Kaixuan, Li, Weiye, Qian, Yifei, Arandjeloviฤ, Ognjen, Fang, Lei
Neural style transfer (NST) is widely adopted in computer vision to generate new images with arbitrary styles. This process leverages neural networks to merge aesthetic elements of a style image with the structural aspects of a content image into a harmoniously integrated visual result. However, unauthorized NST can exploit artwork. Such misuse raises socio-technical concerns regarding artists' rights and motivates the development of technical approaches for the proactive protection of original creations. Adversarial attack is a concept primarily explored in machine learning security. Our work introduces this technique to protect artists' intellectual property. In this paper Locally Adaptive Adversarial Color Attack (LAACA), a method for altering images in a manner imperceptible to the human eyes but disruptive to NST. Specifically, we design perturbations targeting image areas rich in high-frequency content, generated by disrupting intermediate features. Our experiments and user study confirm that by attacking NST using the proposed method results in visually worse neural style transfer, thus making it an effective solution for visual artwork protection.
Retrieval-augmented GPT-3.5-based Text-to-SQL Framework with Sample-aware Prompting and Dynamic Revision Chain
Guo, Chunxi, Tian, Zhiliang, Tang, Jintao, Li, Shasha, Wen, Zhihua, Wang, Kaixuan, Wang, Ting
Text-to-SQL aims at generating SQL queries for the given natural language questions and thus helping users to query databases. Prompt learning with large language models (LLMs) has emerged as a recent approach, which designs prompts to lead LLMs to understand the input question and generate the corresponding SQL. However, it faces challenges with strict SQL syntax requirements. Existing work prompts the LLMs with a list of demonstration examples (i.e. question-SQL pairs) to generate SQL, but the fixed prompts can hardly handle the scenario where the semantic gap between the retrieved demonstration and the input question is large. In this paper, we propose a retrieval-augmented prompting method for a LLM-based Text-to-SQL framework, involving sample-aware prompting and a dynamic revision chain. Our approach incorporates sample-aware demonstrations, which include the composition of SQL operators and fine-grained information related to the given question. To retrieve questions sharing similar intents with input questions, we propose two strategies for assisting retrieval. Firstly, we leverage LLMs to simplify the original questions, unifying the syntax and thereby clarifying the users' intentions. To generate executable and accurate SQLs without human intervention, we design a dynamic revision chain which iteratively adapts fine-grained feedback from the previously generated SQL. Experimental results on three Text-to-SQL benchmarks demonstrate the superiority of our method over strong baseline models.
Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image
Yin, Wei, Zhang, Chi, Chen, Hao, Cai, Zhipeng, Yu, Gang, Wang, Kaixuan, Chen, Xiaozhi, Shen, Chunhua
Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.
The Second Monocular Depth Estimation Challenge
Spencer, Jaime, Qian, C. Stella, Trescakova, Michaela, Russell, Chris, Hadfield, Simon, Graf, Erich W., Adams, Wendy J., Schofield, Andrew J., Elder, James, Bowden, Richard, Anwar, Ali, Chen, Hao, Chen, Xiaozhi, Cheng, Kai, Dai, Yuchao, Hoa, Huynh Thai, Hossain, Sadat, Huang, Jianmian, Jing, Mohan, Li, Bo, Li, Chao, Li, Baojun, Liu, Zhiwen, Mattoccia, Stefano, Mercelis, Siegfried, Nam, Myungwoo, Poggi, Matteo, Qi, Xiaohua, Ren, Jiahui, Tang, Yang, Tosi, Fabio, Trinh, Linh, Uddin, S. M. Nadim, Umair, Khan Muhammad, Wang, Kaixuan, Wang, Yufei, Wang, Yixing, Xiang, Mochu, Xu, Guangkai, Yin, Wei, Yu, Jun, Zhang, Qi, Zhao, Chaoqiang
This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy.
An Efficient B-spline-Based Kinodynamic Replanning Framework for Quadrotors
Ding, Wenchao, Gao, Wenliang, Wang, Kaixuan, Shen, Shaojie
Trajectory replanning for quadrotors is essential to enable fully autonomous flight in unknown environments. Hierarchical motion planning frameworks, which combine path planning with path parameterization, are popular due to their time efficiency. However, the path planning cannot properly deal with non-static initial states of the quadrotor, which may result in non-smooth or even dynamically infeasible trajectories. In this paper, we present an efficient kinodynamic replanning framework by exploiting the advantageous properties of the B-spline, which facilitates dealing with the non-static state and guarantees safety and dynamical feasibility. Our framework starts with an efficient B-spline-based kinodynamic (EBK) search algorithm which finds a feasible trajectory with minimum control effort and time. To compensate for the discretization induced by the EBK search, an elastic optimization (EO) approach is proposed to refine the control point placement to the optimal location. Systematic comparisons against the state-of-the-art are conducted to validate the performance. Comprehensive onboard experiments using two different vision-based quadrotors are carried out showing the general applicability of the framework.