Not enough data to create a plot.
Try a different view from the menu above.
Hong, Yihuai
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction
Hong, Yihuai, Zhou, Dian, Cao, Meng, Yu, Lei, Jin, Zhijing
Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.
Dissecting Fine-Tuning Unlearning in Large Language Models
Hong, Yihuai, Zou, Yuelin, Hu, Lijie, Zeng, Ziqian, Wang, Di, Yang, Haiqin
Although earlier investigations (Hong et al., 2024; Lee et al., 2024a) have Consequently, of these fine-tuning-based unlearning methods recent research has focused on developing on LLaMA2-7B-chat (Touvron et al., 2023) and efficient unlearning methods as a post-training OLMo-7B (Groeneveld et al., 2024) by implementing technique to selectively unlearn the specific knowledge them on the respective pretraining datasets of (Blanco-Justicia et al., 2024; Liu et al., 2024). We discover that while these methods 2023; Jang et al., 2023; Yao et al., 2024; appear to effectively unlearn target knowledge, they Rafailov et al., 2023), with corresponding adjustments also inevitably affect the output and behavior related and designs in the loss function to facilitate to unrelated knowledge.
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Hong, Yihuai, Yu, Lei, Ravfogel, Shauli, Yang, Haiqin, Geva, Mor
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference
Zeng, Ziqian, Hong, Yihuai, Dai, Hongliang, Zhuang, Huiping, Chen, Cen
Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only require each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept Memorize Layer to measure the hardness of an instance. We incorporate memorized layer into reward function design, which allows ``easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks.