Goto

Collaborating Authors

 Chen, Yuxing


RoboHanger: Learning Generalizable Robotic Hanger Insertion for Diverse Garments

arXiv.org Artificial Intelligence

For the task of hanging clothes, learning how to insert a hanger into a garment is crucial but has been seldom explored in robotics. In this work, we address the problem of inserting a hanger into various unseen garments that are initially laid out flat on a table. This task is challenging due to its long-horizon nature, the high degrees of freedom of the garments, and the lack of data. To simplify the learning process, we first propose breaking the task into several stages. Then, we formulate each stage as a policy learning problem and propose low-dimensional action parameterization. To overcome the challenge of limited data, we build our own simulator and create 144 synthetic clothing assets to effectively collect high-quality training data. Our approach uses single-view depth images and object masks as input, which mitigates the Sim2Real appearance gap and achieves high generalization capabilities for new garments. Extensive experiments in both simulation and the real world validate our proposed method. By training on various garments in the simulator, our method achieves a 75\% success rate with 8 different unseen garments in the real world.


An LLM Agent for Automatic Geospatial Data Analysis

arXiv.org Artificial Intelligence

Large language models (LLMs) are being used in data science code generation tasks, but they often struggle with complex sequential tasks, leading to logical errors. Their application to geospatial data processing is particularly challenging due to difficulties in incorporating complex data structures and spatial constraints, effectively utilizing diverse function calls, and the tendency to hallucinate less-used geospatial libraries. To tackle these problems, we introduce GeoAgent, a new interactive framework designed to help LLMs handle geospatial data processing more effectively. GeoAgent pioneers the integration of a code interpreter, static analysis, and Retrieval-Augmented Generation (RAG) techniques within a Monte Carlo Tree Search (MCTS) algorithm, offering a novel approach to geospatial data processing. In addition, we contribute a new benchmark specifically designed to evaluate the LLM-based approach in geospatial tasks. This benchmark leverages a variety of Python libraries and includes both single-turn and multi-turn tasks such as data acquisition, data analysis, and visualization. By offering a comprehensive evaluation among diverse geospatial contexts, this benchmark sets a new standard for developing LLM-based approaches in geospatial data analysis tasks. Our findings suggest that relying solely on knowledge of LLM is insufficient for accurate geospatial task programming, which requires coherent multi-step processes and multiple function calls. Compared to the baseline LLMs, the proposed GeoAgent has demonstrated superior performance, yielding notable improvements in function calls and task completion. In addition, these results offer valuable insights for the future development of LLM agents in automatic geospatial data analysis task programming.


ScissorBot: Learning Generalizable Scissor Skill for Paper Cutting via Simulation, Imitation, and Sim2Real

arXiv.org Artificial Intelligence

This paper tackles the challenging robotic task of generalizable paper cutting using scissors. In this task, scissors attached to a robot arm are driven to accurately cut curves drawn on the paper, which is hung with the top edge fixed. Due to the frequent paper-scissor contact and consequent fracture, the paper features continual deformation and changing topology, which is diffult for accurate modeling. To ensure effective execution, we customize an action primitive sequence for imitation learning to constrain its action space, thus alleviating potential compounding errors. Finally, by integrating sim-to-real techniques to bridge the gap between simulation and reality, our policy can be effectively deployed on the real robot. Experimental results demonstrate that our method surpasses all baselines in both simulation and real-world benchmarks and achieves performance comparable to human operation with a single hand under the same conditions.


Combining Optimal Transport and Embedding-Based Approaches for More Expressiveness in Unsupervised Graph Alignment

arXiv.org Artificial Intelligence

Unsupervised graph alignment finds the one-to-one node correspondence between a pair of attributed graphs by only exploiting graph structure and node features. One category of existing works first computes the node representation and then matches nodes with close embeddings, which is intuitive but lacks a clear objective tailored for graph alignment in the unsupervised setting. The other category reduces the problem to optimal transport (OT) via Gromov-Wasserstein (GW) learning with a well-defined objective but leaves a large room for exploring the design of transport cost. We propose a principled approach to combine their advantages motivated by theoretical analysis of model expressiveness. By noticing the limitation of discriminative power in separating matched and unmatched node pairs, we improve the cost design of GW learning with feature transformation, which enables feature interaction across dimensions. Besides, we propose a simple yet effective embedding-based heuristic inspired by the Weisfeiler-Lehman test and add its prior knowledge to OT for more expressiveness when handling non-Euclidean data. Moreover, we are the first to guarantee the one-to-one matching constraint by reducing the problem to maximum weight matching. The algorithm design effectively combines our OT and embedding-based predictions via stacking, an ensemble learning strategy. We propose a model framework named \texttt{CombAlign} integrating all the above modules to refine node alignment progressively. Through extensive experiments, we demonstrate significant improvements in alignment accuracy compared to state-of-the-art approaches and validate the effectiveness of the proposed modules.


Task-Oriented Dexterous Grasp Synthesis via Differentiable Grasp Wrench Boundary Estimator

arXiv.org Artificial Intelligence

Analytical dexterous grasping synthesis is often driven by grasp quality metrics. However, existing metrics possess many problems, such as being computationally expensive, physically inaccurate, and non-differentiable. Moreover, none of them can facilitate the synthesis of non-force-closure grasps, which account for a significant portion of task-oriented grasping such as lid screwing and button pushing. The main challenge behind all the above drawbacks is the difficulty in modeling the complex Grasp Wrench Space (GWS). In this work, we overcome this challenge by proposing a novel GWS estimator, thus enabling gradient-based task-oriented dexterous grasp synthesis for the first time. Our key contribution is a fast, accurate, and differentiable technique to estimate the GWS boundary with good physical interpretability by parallel sampling and mapping, which does not require iterative optimization. Second, based on our differentiable GWS estimator, we derive a task-oriented energy function to enable gradient-based grasp synthesis and a metric to evaluate non-force-closure grasps. Finally, we improve the previous dexterous grasp synthesis pipeline mainly by a novel technique to make nearest-point calculation differentiable, even on mesh edges and vertices. Extensive experiments are performed to verify the efficiency and effectiveness of our methods. Our GWS estimator can run in several milliseconds on GPUs with minimal memory cost, more than three orders of magnitude faster than the classic discretization-based method. Using this GWS estimator, we synthesize 0.1 million dexterous grasps to show that our pipeline can significantly outperform the SOTA method, even in task-unaware force-closure-grasp synthesis. For task-oriented grasp synthesis, we provide some qualitative results. Our project page is https://pku-epic.github.io/TaskDexGrasp/.


Exploring Global Diversity and Local Context for Video Summarization

arXiv.org Artificial Intelligence

Video summarization aims to automatically generate a diverse and concise summary which is useful in large-scale video processing. Most of methods tend to adopt self attention mechanism across video frames, which fails to model the diversity of video frames. To alleviate this problem, we revisit the pairwise similarity measurement in self attention mechanism and find that the existing inner-product affinity leads to discriminative features rather than diversified features. In light of this phenomenon, we propose global diverse attention by using the squared Euclidean distance instead to compute the affinities. Moreover, we model the local contextual information by proposing local contextual attention to remove the redundancy in the video. By combining these two attention mechanism, a video \textbf{SUM}marization model with Diversified Contextual Attention scheme is developed and named as SUM-DCA. Extensive experiments are conducted on benchmark data sets to verify the effectiveness and the superiority of SUM-DCA in terms of F-score and rank-based evaluation without any bells and whistles.