Ding, Liang
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models
Wang, Xintong, Pan, Jingheng, Jiang, Longqin, Ding, Liang, Li, Xingshan, Biemann, Chris
Despite their impressive capabilities, large language models (LLMs) often lack interpretability and can generate toxic content. While using LLMs as foundation models and applying semantic steering methods are widely practiced, we believe that efficient methods should be based on a thorough understanding of LLM behavior. To this end, we propose using eye movement measures to interpret LLM behavior across layers. We find that LLMs exhibit patterns similar to human gaze across layers and different layers function differently. Inspired by these findings, we introduce a heuristic steering layer selection and apply it to layer intervention methods via fine-tuning and inference. Using language toxification and detoxification as test beds, we demonstrate that our proposed CogSteer methods achieve better results in terms of toxicity scores while efficiently saving 97% of the computational resources and 60% of the training time. Our model-agnostic approach can be adopted into various LLMs, contributing to their interpretability and promoting trustworthiness for safe deployment.
Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to-SQL
Zhong, Qihuang, Chen, Kunfeng, Ding, Liang, Liu, Juhua, Du, Bo, Tao, Dacheng
Large Language Models (LLMs) have shown promising performance in text-to-SQL, which involves translating natural language questions into SQL queries. However, current text-to-SQL LLMs are computationally expensive and challenging to deploy in real-world applications, highlighting the importance of compressing them. To achieve this goal, knowledge distillation (KD) is a common approach, which aims to distill the larger teacher model into a smaller student model. While numerous KD methods for autoregressive LLMs have emerged recently, it is still under-explored whether they work well in complex text-to-SQL scenarios. To this end, we conduct a series of analyses and reveal that these KD methods generally fall short in balancing performance and efficiency. In response to this problem, we propose to improve the KD with Imperfect Data, namely KID, which effectively boosts the performance without introducing much training budget. The core of KID is to efficiently mitigate the training-inference mismatch by simulating the cascading effect of inference in the imperfect training data. Extensive experiments on 5 text-to-SQL benchmarks show that, KID can not only achieve consistent and significant performance gains (up to +5.83% average score) across all model types and sizes, but also effectively improve the training efficiency.
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models
Wang, Fei, Shen, Li, Ding, Liang, Xue, Chao, Liu, Ye, Ding, Changxing
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, we discover that the full-parameter perturbation and updating processes consume over 50% of its overall fine-tuning time cost. Based on these observations, we introduce a novel layer-wise sparse computation and memory efficient ZO optimizer, named LeZO. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. LeZO incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO stochastic gradient descent (ZO-SGD). It achieves accelerated computation during perturbation and updating processes without additional memory overhead. We conduct extensive experiments with the OPT model family on the SuperGLUE benchmark and two generative tasks. The experiments show that LeZO accelerates training without compromising the performance of ZO optimization. Specifically, it achieves over 3x speedup compared to MeZO on the SST-2, BoolQ, and Copa tasks.
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Yu, Tengfei, Liu, Xuebo, Hou, Zhiyi, Ding, Liang, Tao, Dacheng, Zhang, Min
Large language models (LLMs) exhibit remarkable performance across diverse tasks, indicating their potential for expansion into large speech-text models (LSMs) by integrating speech capabilities. Although unified speech-text pre-training and multimodal data instruction-tuning offer considerable benefits, these methods generally entail significant resource demands and tend to overfit specific tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We explore the instruction-following dynamics within LSMs, identifying a critical issue termed speech anchor bias-a tendency for LSMs to over-rely on speech inputs, mistakenly interpreting the entire speech modality as directives, thereby neglecting textual instructions. To counteract this bias, we introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning. Our experiments across a range of speech-based tasks demonstrate that self-powered LSM mitigates speech anchor bias and improves the fusion of speech and text modalities in LSMs. Data, code and scripts are freely available at https://github.com/ytf-philp/Self-powered-LSM.
End-to-End Graph Flattening Method for Large Language Models
Hong, Bin, Wu, Jinze, Liu, Jiayu, Ding, Liang, Sha, Jing, Zhang, Kai, Wang, Shijin, Huang, Zhenya
In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
He, Shwai, Dong, Daize, Ding, Liang, Li, Ang
Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B. Code is released at \url{https://github.com/DaizeDong/Unified-MoE-Compression}.
Revisiting Knowledge Distillation for Autoregressive Language Models
Zhong, Qihuang, Ding, Liang, Shen, Li, Liu, Juhua, Du, Bo, Tao, Dacheng
Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we empirically find that larger teacher LMs might dramatically result in a poorer student. In response to this problem, we conduct a series of analyses and reveal that different tokens have different teaching modes, neglecting which will lead to performance degradation. Motivated by this, we propose a simple yet effective adaptive teaching approach (ATKD) to improve the KD. The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains (up to +3.04% average score) across all model types and sizes. More encouragingly, ATKD can improve the student model generalization effectively.
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle
Bao, Rong, Zheng, Rui, Dou, Shihan, Wang, Xiao, Zhou, Enyu, Wang, Bo, Zhang, Qi, Ding, Liang, Tao, Dacheng
In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human preferences according to the criticism. Additionally, we use a self-consistency method to further reduce the impact of position bias, and employ semantic perplexity to calculate the preference strength differences between different answers. Experimental results show that our method enables 13B and 70B Llama2-Chat annotators to provide high-quality preference feedback, and the policy models trained based on these preference data achieve significant advantages in benchmark datasets through reinforcement learning.
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding
Zhong, Qihuang, Ding, Liang, Liu, Juhua, Du, Bo, Tao, Dacheng
With the development of instruction-tuned large language models (LLMs), improving the safety of LLMs has become more critical. However, the current approaches for aligning the LLMs output with expected safety usually require substantial training efforts, e.g., high-quality safety data and expensive computational resources, which are costly and inefficient. To this end, we present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to directly boost the safety of existing instruction-tuned LLMs without any additional training. The principle of ROSE is to improve the probability of desired safe output via suppressing the undesired output induced by the carefully-designed reverse prompts. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs. In-depth analyses explore the underlying mechanism of ROSE, and reveal when and where to use it.
Revisiting Catastrophic Forgetting in Large Language Model Tuning
Li, Hongyu, Ding, Liang, Fang, Meng, Tao, Dacheng
Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.