Goto

Collaborating Authors

 Zhan, Yibing


An Atomic Skill Library Construction Method for Data-Efficient Embodied Manipulation

arXiv.org Artificial Intelligence

Embodied manipulation is a fundamental ability in the realm of embodied artificial intelligence. Although current embodied manipulation models show certain generalizations in specific settings, they struggle in new environments and tasks due to the complexity and diversity of real-world scenarios. The traditional end-to-end data collection and training manner leads to significant data demands. Decomposing end-to-end tasks into atomic skills helps reduce data requirements and improves the task success rate. However, existing methods are limited by predefined skill sets that cannot be dynamically updated. To address the issue, we introduce a three-wheeled data-driven method to build an atomic skill library. We divide tasks into subtasks using the Vision-Language-Planning (VLP). Then, atomic skill definitions are formed by abstracting the subtasks. Finally, an atomic skill library is constructed via data collection and Vision-Language-Action (VLA) fine-tuning. As the atomic skill library expands dynamically with the three-wheel update strategy, the range of tasks it can cover grows naturally. In this way, our method shifts focus from end-to-end tasks to atomic skills, significantly reducing data costs while maintaining high performance and enabling efficient adaptation to new tasks. Extensive experiments in real-world settings demonstrate the effectiveness and efficiency of our approach.


Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs

arXiv.org Artificial Intelligence

Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs' one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.


Modeling All Response Surfaces in One for Conditional Search Spaces

arXiv.org Artificial Intelligence

Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.


Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

arXiv.org Artificial Intelligence

Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content. Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs.


AGMixup: Adaptive Graph Mixup for Semi-supervised Node Classification

arXiv.org Artificial Intelligence

Mixup is a data augmentation technique that enhances model generalization by interpolating between data points using a mixing ratio $\lambda$ in the image domain. Recently, the concept of mixup has been adapted to the graph domain through node-centric interpolations. However, these approaches often fail to address the complexity of interconnected relationships, potentially damaging the graph's natural topology and undermining node interactions. Furthermore, current graph mixup methods employ a one-size-fits-all strategy with a randomly sampled $\lambda$ for all mixup pairs, ignoring the diverse needs of different pairs. This paper proposes an Adaptive Graph Mixup (AGMixup) framework for semi-supervised node classification. AGMixup introduces a subgraph-centric approach, which treats each subgraph similarly to how images are handled in Euclidean domains, thus facilitating a more natural integration of mixup into graph-based learning. We also propose an adaptive mechanism to tune the mixing ratio $\lambda$ for diverse mixup pairs, guided by the contextual similarity and uncertainty of the involved subgraphs. Extensive experiments across seven datasets on semi-supervised node classification benchmarks demonstrate AGMixup's superiority over state-of-the-art graph mixup methods. Source codes are available at \url{https://github.com/WeigangLu/AGMixup}.


DA-MoE: Addressing Depth-Sensitivity in Graph-Level Analysis through Mixture of Experts

arXiv.org Artificial Intelligence

Graph neural networks (GNNs) are gaining popularity for processing graph-structured data. In real-world scenarios, graph data within the same dataset can vary significantly in scale. This variability leads to depth-sensitivity, where the optimal depth of GNN layers depends on the scale of the graph data. Empirically, fewer layers are sufficient for message passing in smaller graphs, while larger graphs typically require deeper networks to capture long-range dependencies and global features. However, existing methods generally use a fixed number of GNN layers to generate representations for all graphs, overlooking the depth-sensitivity issue in graph structure data. To address this challenge, we propose the depth adaptive mixture of expert (DA-MoE) method, which incorporates two main improvements to GNN backbone: \textbf{1)} DA-MoE employs different GNN layers, each considered an expert with its own parameters. Such a design allows the model to flexibly aggregate information at different scales, effectively addressing the depth-sensitivity issue in graph data. \textbf{2)} DA-MoE utilizes GNN to capture the structural information instead of the linear projections in the gating network. Thus, the gating network enables the model to capture complex patterns and dependencies within the data. By leveraging these improvements, each expert in DA-MoE specifically learns distinct graph patterns at different scales. Furthermore, comprehensive experiments on the TU dataset and open graph benchmark (OGB) have shown that DA-MoE consistently surpasses existing baselines on various tasks, including graph, node, and link-level analyses. The code are available at \url{https://github.com/Celin-Yao/DA-MoE}.


NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models

arXiv.org Artificial Intelligence

Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo's effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability. One of the most significant challenges facing Large Language Models (LLMs) today is their tendency to hallucinate--outputs that are factually incorrect or entirely fabricated (Zhang et al., 2023b). This flaw is particularly serious in high-stakes applications like finance and healthcare, where even small errors can lead to huge losses and compromised patient safety (Kang & Liu, 2023; Pal et al., 2023). Reducing factual hallucinations is a critical research area with major practical benefits, essential for realising the full potential of LLMs to revolutionise these industries by enhancing efficiency and decision-making, and safeguarding against costly and harmful errors (Kaddour et al., 2023). Given these serious risks and the high cost of retraining LLMs, it is crucial to find affordable techniques to reduce factual hallucinations. Although inference techniques such as retrieval augmentation and prompt engineering work well, they come with significant limitations: latency and external dependencies, and the need for user expertise, respectively (Zhao et al., 2024; Sahoo et al., 2024). In response, we turn to representation editing and reading methods (REAR) (Zou et al., 2023), which operate within the model, ensuring rapid response times and eliminating the need for external data or user interaction.


Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks

arXiv.org Artificial Intelligence

The advancements in embodied AI are increasingly enabling robots to tackle complex real-world tasks, such as household manipulation. However, the deployment of robots in these environments remains constrained by the lack of comprehensive bimanual-mobile robot manipulation data that can be learned. Existing datasets predominantly focus on single-arm manipulation tasks, while the few dual-arm datasets available often lack mobility features, task diversity, comprehensive sensor data, and robust evaluation metrics; they fail to capture the intricate and dynamic nature of household manipulation tasks that bimanual-mobile robots are expected to perform. To overcome these limitations, we propose BRMData, a Bimanual-mobile Robot Manipulation Dataset specifically designed for household applications. BR-MData encompasses 10 diverse household tasks, including single-arm and dual-arm tasks, as well as both tabletop and mobile manipulations, utilizing multi-view and depth-sensing data information. Moreover, BRMData features tasks of increasing difficulty, ranging from single-object to multi-object grasping, non-interactive to human-robot interactive scenarios, and rigid-object to flexible-object manipulation, closely simulating real-world household applications. Additionally, we introduce a novel Manipulation Efficiency Score (MES) metric to evaluate both the precision and efficiency of robot manipulation methods in household tasks. We thoroughly evaluate and analyze the performance of advanced robot manipulation learning methods using our BRMData, aiming to drive the development of bimanual-mobile robot manipulation technologies. The dataset is now open-sourced and available at https://embodiedrobot.github.io/.


Hi-GMAE: Hierarchical Graph Masked Autoencoders

arXiv.org Artificial Intelligence

Graph Masked Autoencoders (GMAEs) have emerged as a notable self-supervised learning approach for graph-structured data. Existing GMAE models primarily focus on reconstructing node-level information, categorizing them as single-scale GMAEs. This methodology, while effective in certain contexts, tends to overlook the complex hierarchical structures inherent in many real-world graphs. For instance, molecular graphs exhibit a clear hierarchical organization in the form of the atoms-functional groups-molecules structure. Hence, the inability of single-scale GMAE models to incorporate these hierarchical relationships often leads to their inadequate capture of crucial high-level graph information, resulting in a noticeable decline in performance. To address this limitation, we propose Hierarchical Graph Masked AutoEncoders (Hi-GMAE), a novel multi-scale GMAE framework designed to handle the hierarchical structures within graphs. First, Hi-GMAE constructs a multi-scale graph hierarchy through graph pooling, enabling the exploration of graph structures across different granularity levels. To ensure masking uniformity of subgraphs across these scales, we propose a novel coarse-to-fine strategy that initiates masking at the coarsest scale and progressively back-projects the mask to the finer scales. Furthermore, we integrate a gradual recovery strategy with the masking process to mitigate the learning challenges posed by completely masked subgraphs. Diverging from the standard graph neural network (GNN) used in GMAE models, Hi-GMAE modifies its encoder and decoder into hierarchical structures. This entails using GNN at the finer scales for detailed local graph analysis and employing a graph transformer at coarser scales to capture global information. Our experiments on 15 graph datasets consistently demonstrate that Hi-GMAE outperforms 17 state-of-the-art self-supervised competitors.


Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning

arXiv.org Artificial Intelligence

Answering complex queries over incomplete knowledge graphs (KGs) is a challenging job. Most previous works have focused on learning entity/relation embeddings and simulating first-order logic operators with various neural networks. However, they are bottlenecked by the inability to share world knowledge to improve logical reasoning, thus resulting in suboptimal performance. In this paper, we propose a complex reasoning schema over KG upon large language models (LLMs), containing a curriculum-based logical-aware instruction tuning framework, named LACT. Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition, to stimulate the reasoning capability of LLMs. To address the difficulty gap among different types of complex queries, we design a simple and flexible logic-aware curriculum learning framework. Experiments across widely used datasets demonstrate that LACT has substantial improvements~(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art. Our code and model will be released at GitHub and huggingface soon.