Goto

Collaborating Authors

 Wang, Xingwei


Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product

arXiv.org Artificial Intelligence

Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model's comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.


Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model

arXiv.org Machine Learning

The distributionally robust optimization (DRO)-based graph neural network methods improve recommendation systems' out-of-distribution (OOD) generalization by optimizing the model's worst-case performance. However, these studies fail to consider the impact of noisy samples in the training data, which results in diminished generalization capabilities and lower accuracy. Through experimental and theoretical analysis, this paper reveals that current DRO-based graph recommendation methods assign greater weight to noise distribution, leading to model parameter learning being dominated by it. When the model overly focuses on fitting noise samples in the training data, it may learn irrelevant or meaningless features that cannot be generalized to OOD data. To address this challenge, we design a Distributionally Robust Graph model for OOD recommendation (DRGO). Specifically, our method first employs a simple and effective diffusion paradigm to alleviate the noisy effect in the latent space. Additionally, an entropy regularization term is introduced in the DRO objective function to avoid extreme sample weights in the worst-case distribution. Finally, we provide a theoretical proof of the generalization error bound of DRGO as well as a theoretical analysis of how our approach mitigates noisy sample effects, which helps to better understand the proposed framework from a theoretical perspective. We conduct extensive experiments on four datasets to evaluate the effectiveness of our framework against three typical distribution shifts, and the results demonstrate its superiority in both independently and identically distributed distributions (IID) and OOD.


SurgeryV2: Bridging the Gap Between Model Merging and Multi-Task Learning with Deep Representation Surgery

arXiv.org Artificial Intelligence

Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models without requiring access to raw training data. However, in this paper, we examine the merged model's representation distribution and uncover a critical issue of "representation bias". This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model. To address this challenge, we first propose a representation surgery solution called Surgery. Surgery is a lightweight, task-specific module that aligns the final layer representations of the merged model with those of the expert models, effectively alleviating bias and improving the merged model's performance. Despite these improvements, a performance gap remains compared to the traditional MTL method. Further analysis reveals that representation bias phenomena exist at each layer of the merged model, and aligning representations only in the last layer is insufficient for fully reducing systemic bias because biases introduced at each layer can accumulate and interact in complex ways. To tackle this, we then propose a more comprehensive solution, deep representation surgery (also called SurgeryV2), which mitigates representation bias across all layers, and thus bridges the performance gap between model merging-based MTL and traditional MTL. Finally, we design an unsupervised optimization objective to optimize both the Surgery and SurgeryV2 modules. Our experimental results show that incorporating these modules into state-of-the-art (SOTA) model merging schemes leads to significant performance gains. Notably, our SurgeryV2 scheme reaches almost the same level as individual expert models or the traditional MTL model. The code is available at \url{https://github.com/EnnengYang/SurgeryV2}.


Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion

arXiv.org Artificial Intelligence

Prompt tuning is a promising method to fine-tune a pre-trained language model without retraining its large-scale parameters. Instead, it attaches a soft prompt to the input text, whereby downstream tasks can be well adapted by merely learning the embeddings of prompt tokens. Nevertheless, existing methods still suffer from two challenges: (i) they are hard to balance accuracy and efficiency. A longer (shorter) soft prompt generally leads to a better(worse) accuracy but at the cost of more (less) training time. (ii)The performance may not be consistent when adapting to different downstream tasks. We attribute it to the same embedding space but responsible for different requirements of downstream tasks. To address these issues, we propose an Efficient Prompt Tuning method (EPT) by multi-space projection and prompt fusion. Specifically, it decomposes a given soft prompt into a shorter prompt and two low-rank matrices, significantly reducing the training time. Accuracy is also enhanced by leveraging low-rank matrices and the short prompt as additional knowledge sources to enrich the semantics of the original short prompt. In addition, we project the soft prompt into multiple subspaces to improve the performance consistency, and then adaptively learn the combination weights of different spaces through a gating network. Experiments on 13 natural language processing downstream tasks show that our method significantly and consistently outperforms 11 comparison methods with the relative percentage of improvements up to 12.9%, and training time decreased by 14%.


Towards Unified Modeling for Positive and Negative Preferences in Sign-Aware Recommendation

arXiv.org Artificial Intelligence

Recently, sign-aware graph recommendation has drawn much attention as it will learn users' negative preferences besides positive ones from both positive and negative interactions (i.e., links in a graph) with items. To accommodate the different semantics of negative and positive links, existing works utilize two independent encoders to model users' positive and negative preferences, respectively. However, these approaches cannot learn the negative preferences from high-order heterogeneous interactions between users and items formed by multiple links with different signs, resulting in inaccurate and incomplete negative user preferences. To cope with these intractable issues, we propose a novel \textbf{L}ight \textbf{S}igned \textbf{G}raph Convolution Network specifically for \textbf{Rec}ommendation (\textbf{LSGRec}), which adopts a unified modeling approach to simultaneously model high-order users' positive and negative preferences on a signed user-item interaction graph. Specifically, for the negative preferences within high-order heterogeneous interactions, first-order negative preferences are captured by the negative links, while high-order negative preferences are propagated along positive edges. Then, recommendation results are generated based on positive preferences and optimized with negative ones. Finally, we train representations of users and items through different auxiliary tasks. Extensive experiments on three real-world datasets demonstrate that our method outperforms existing baselines regarding performance and computational efficiency. Our code is available at \url{https://anonymous.4open.science/r/LSGRec-BB95}.


Pareto-wise Ranking Classifier for Multi-objective Evolutionary Neural Architecture Search

arXiv.org Artificial Intelligence

In the deployment of deep neural models, how to effectively and automatically find feasible deep models under diverse design objectives is fundamental. Most existing neural architecture search (NAS) methods utilize surrogates to predict the detailed performance (e.g., accuracy and model size) of a candidate architecture during the search, which however is complicated and inefficient. In contrast, we aim to learn an efficient Pareto classifier to simplify the search process of NAS by transforming the complex multi-objective NAS task into a simple Pareto-dominance classification task. To this end, we propose a classification-wise Pareto evolution approach for one-shot NAS, where an online classifier is trained to predict the dominance relationship between the candidate and constructed reference architectures, instead of using surrogates to fit the objective functions. The main contribution of this study is to change supernet adaption into a Pareto classifier. Besides, we design two adaptive schemes to select the reference set of architectures for constructing classification boundary and regulate the rate of positive samples over negative ones, respectively. We compare the proposed evolution approach with state-of-the-art approaches on widely-used benchmark datasets, and experimental results indicate that the proposed approach outperforms other approaches and have found a number of neural architectures with different model sizes ranging from 2M to 6M under diverse objectives and constraints.


Representation Surgery for Multi-Task Model Merging

arXiv.org Artificial Intelligence

Multi-task learning (MTL) compresses the information from multiple tasks into a unified backbone to improve computational efficiency and generalization. Recent work directly merges multiple independently trained models to perform MTL instead of collecting their raw data for joint training, greatly expanding the application scenarios of MTL. However, by visualizing the representation distribution of existing model merging schemes, we find that the merged model often suffers from the dilemma of representation bias. That is, there is a significant discrepancy in the representation distribution between the merged and individual models, resulting in poor performance of merged MTL. In this paper, we propose a representation surgery solution called "Surgery" to reduce representation bias in the merged model. Specifically, Surgery is a lightweight task-specific module that takes the representation of the merged model as input and attempts to output the biases contained in the representation from the merged model. We then designed an unsupervised optimization objective that updates the Surgery module by minimizing the distance between the merged model's representation and the individual model's representation. Extensive experiments demonstrate significant MTL performance improvements when our Surgery module is applied to state-of-the-art (SOTA) model merging schemes.


Uniform Sequence Better: Time Interval Aware Data Augmentation for Sequential Recommendation

arXiv.org Artificial Intelligence

Sequential recommendation is an important task to predict the next-item to access based on a sequence of interacted items. Most existing works learn user preference as the transition pattern from the previous item to the next one, ignoring the time interval between these two items. However, we observe that the time interval in a sequence may vary significantly different, and thus result in the ineffectiveness of user modeling due to the issue of \emph{preference drift}. In fact, we conducted an empirical study to validate this observation, and found that a sequence with uniformly distributed time interval (denoted as uniform sequence) is more beneficial for performance improvement than that with greatly varying time interval. Therefore, we propose to augment sequence data from the perspective of time interval, which is not studied in the literature. Specifically, we design five operators (Ti-Crop, Ti-Reorder, Ti-Mask, Ti-Substitute, Ti-Insert) to transform the original non-uniform sequence to uniform sequence with the consideration of variance of time intervals. Then, we devise a control strategy to execute data augmentation on item sequences in different lengths. Finally, we implement these improvements on a state-of-the-art model CoSeRec and validate our approach on four real datasets. The experimental results show that our approach reaches significantly better performance than the other 11 competing methods. Our implementation is available: https://github.com/KingGugu/TiCoSeRec.


ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation

arXiv.org Artificial Intelligence

Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structures. Then, we propose a novel recommendation model by incorporating ID embeddings to enhance the semantic features of both content and structures. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolutional network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.


AdaMerging: Adaptive Model Merging for Multi-Task Learning

arXiv.org Artificial Intelligence

Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11\% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase.