Yu, Runpeng
Multi-Level Collaboration in Model Merging
Li, Qi, Yu, Runpeng, Wang, Xinchao
Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling-commonly viewed as the upper bound for merging-to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between model merging and model ensembling, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed Neural Ligand (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44% vs. ensembling: 95.46%).
Introducing Visual Perception Token into Multimodal Large Language Model
Yu, Runpeng, Ma, Xinyin, Wang, Xinchao
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Ma, Xinyin, Wan, Guangnian, Yu, Runpeng, Fang, Gongfan, Wang, Xinchao
Chain-of-Thought significantly enhances a model's reasoning capability, but it also comes with a considerable increase in inference costs due to long chains. With the observation that the reasoning path can be easily compressed under easy tasks but struggle on hard tasks, we explore the feasibility of elastically controlling the length of reasoning paths with only one model, thereby reducing the inference overhead of reasoning models dynamically based on task difficulty. We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths. To achieve this, we propose to identify a direction in the parameter space that, when manipulated, can effectively control the length of generated CoT. Moreover, we show that this property is valuable for compressing the reasoning chain. We construct datasets with chains from long to short for the same questions and explore two enhanced strategies for CoT-Valve: (1) a precise length-compressible CoT tuning method, and (2) a progressive chain length compression approach. Our experiments show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control. We applied this method to QwQ-32B-Preview, reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with only one additional incorrect answer.
Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective
Mo, Yujie, Lu, Zhihe, Yu, Runpeng, Zhu, Xiaofeng, Wang, Xinchao
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios. However, while existing SHGL methods share a similar essential with clustering approaches, they encounter two significant limitations: (i) noise in graph structures is often introduced during the message-passing process to weaken node representations, and (ii) cluster-level information may be inadequately captured and leveraged, diminishing the performance in downstream tasks. In this paper, we address these limitations by theoretically revisiting SHGL from the spectral clustering perspective and introducing a novel framework enhanced by rank and dual consistency constraints. Specifically, our framework incorporates a rank-constrained spectral clustering method that refines the affinity matrix to exclude noise effectively. Additionally, we integrate node-level and cluster-level consistency constraints that concurrently capture invariant and clustering information to facilitate learning in downstream tasks. We theoretically demonstrate that the learned representations are divided into distinct partitions based on the number of classes and exhibit enhanced generalization ability across tasks. Experimental results affirm the superiority of our method, showcasing remarkable improvements in several downstream tasks compared to existing methods.
HG-Adapter: Improving Pre-Trained Heterogeneous Graph Neural Networks with Dual Adapters
Mo, Yujie, Yu, Runpeng, Zhu, Xiaofeng, Wang, Xinchao
The "pre-train, prompt-tuning" paradigm has demonstrated impressive performance for tuning pre-trained heterogeneous graph neural networks (HGNNs) by mitigating the gap between pre-trained models and downstream tasks. However, most prompt-tuning-based works may face at least two limitations: (i) the model may be insufficient to fit the graph structures well as they are generally ignored in the prompt-tuning stage, increasing the training error to decrease the generalization ability; and (ii) the model may suffer from the limited labeled data during the prompt-tuning stage, leading to a large generalization gap between the training error and the test error to further affect the model generalization. To alleviate the above limitations, we first derive the generalization error bound for existing prompttuning-based methods, and then propose a unified framework that combines two new adapters with potential labeled data extension to improve the generalization of pre-trained HGNN models. Specifically, we design dual structure-aware adapters to adaptively fit task-related homogeneous and heterogeneous structural information. We further design a label-propagated contrastive loss and two self-supervised losses to optimize dual adapters and incorporate unlabeled nodes as potential labeled data. Theoretical analysis indicates that the proposed method achieves a lower generalization error bound than existing methods, thus obtaining superior generalization ability. Comprehensive experiments demonstrate the effectiveness and generalization of the proposed method on different downstream tasks. Pre-trained heterogeneous graph neural networks (HGNNs) are designed to pre-train models on the heterogeneous graph data and then effectively generalize to diverse tasks (Fan et al., 2019; Jiang et al., 2021). To achieve this, current pre-trained HGNNs typically utilize unsupervised techniques during pre-training to learn fundamental properties, thereby enhancing the generalization ability of models (Yang et al., 2022; Fan et al., 2024). Consequently, pre-trained HGNNs have demonstrated promising potential in real applications such as recommendation systems, social network analysis, and molecular design (Shi et al., 2016; Tian et al., 2023; Wu et al., 2024). Existing pre-trained HGNNs generally follow two paradigms, i.e., "pre-train, fine-tuning" and "pretrain, prompt-tuning". The "pre-train, fine-tuning" paradigm typically first trains the model with unlabeled data in the pre-training stage, and then updates the pre-trained model with task-related labels in the fine-tuning stage to adapt it to downstream tasks (Wang et al., 2021; Tian et al., 2023).
Attention Prompting on Image for Large Vision-Language Models
Yu, Runpeng, Yu, Weihao, Wang, Xinchao
Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.
Through the Dual-Prism: A Spectral Perspective on Graph Data Augmentation for Graph Classification
Xia, Yutong, Yu, Runpeng, Liang, Yuxuan, Bresson, Xavier, Wang, Xinchao, Zimmermann, Roger
Graph Neural Networks (GNNs) have become the preferred tool to process graph data, with their efficacy being boosted through graph data augmentation techniques. Despite the evolution of augmentation methods, issues like graph property distortions and restricted structural changes persist. This leads to the question: Is it possible to develop more property-conserving and structure-sensitive augmentation methods? Through a spectral lens, we investigate the interplay between graph properties, their augmentation, and their spectral behavior, and found that keeping the low-frequency eigenvalues unchanged can preserve the critical properties at a large scale when generating augmented graphs. These observations inform our introduction of the Dual-Prism (DP) augmentation method, comprising DP-Noise and DP-Mask, which adeptly retains essential graph properties while diversifying augmented graphs. Graph structures, modeling complex systems through nodes and edges, are ubiquitous across various domains, including social networks (Newman et al., 2002), bioinformatics (Yi et al., 2022), and transportation systems (Jin et al., 2023a). Graph Neural Networks (GNNs) (Kipf & Welling, 2016a) elegantly handle this relational information, paving the way for tasks such as accurate predictions. Their capabilities are further enhanced by graph data augmentation techniques. These methods artificially diversify the dataset through strategic manipulations, thereby bolstering the performance and generalization of GNNs (Rong et al., 2019; Feng et al., 2020; You et al., 2020). Graph data augmentation has progressed from early random topological modifications, exemplified by DropEdge (Rong et al., 2019) and DropNode (Feng et al., 2020), to sophisticated learning-centric approaches like InfoMin (Suresh et al., 2021). Furthermore, techniques inspired by image augmentation's mixup principle (Zhang et al., 2017) have emerged as prominent contenders in this domain (Verma et al., 2019; Wang et al., 2021; Guo & Mao, 2021). Though promising, these augmentation methods are challenged by three key issues as follows. Before the era of deep learning, graph properties, e.g., graph connectivity and diameter, served as vital features for classification for decades (Childs et al., 2009). While now they seem to be ignored, many aforementioned contemporary augmentation methods appear to sidestep this tradition and overlook the graph properties.
Generator Born from Classifier
Yu, Runpeng, Wang, Xinchao
In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples.
Distribution Shift Inversion for Out-of-Distribution Prediction
Yu, Runpeng, Liu, Songhua, Yang, Xingyi, Wang, Xinchao
Machine learning society has witnessed the emergence of a myriad of Out-of-Distribution (OoD) algorithms, which address the distribution shift between the training and the testing distribution by searching for a unified predictor or invariant feature representation. However, the task of directly mitigating the distribution shift in the unseen testing set is rarely investigated, due to the unavailability of the testing distribution during the training phase and thus the impossibility of training a distribution translator mapping between the training and testing distribution. In this paper, we explore how to bypass the requirement of testing distribution for distribution translator training and make the distribution translation useful for OoD prediction. We propose a portable Distribution Shift Inversion algorithm, in which, before being fed into the prediction model, the OoD testing samples are first linearly combined with additional Gaussian noise and then transferred back towards the training distribution using a diffusion model trained only on the source distribution. Theoretical analysis reveals the feasibility of our method. Experimental results, on both multiple-domain generalization datasets and single-domain generalization datasets, show that our method provides a general performance gain when plugged into a wide range of commonly used OoD algorithms.