unintervened
Risk-Aware Distributional Intervention Policies for Language Models
Nguyen, Bao, Nguyen, Binh, Nguyen, Duy, Nguyen, Viet Anh
Language models are prone to occasionally undesirable generations, such as harmful or toxic content, despite their impressive capability to produce texts that appear accurate and coherent. This paper presents a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for contents that are detected as undesirable, we propose layerwise distributional intervention policies that perturb the attention heads minimally while guaranteeing probabilistically the effectiveness of the intervention. Benchmarks on several language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.
- Europe > United Kingdom (0.28)
- Africa (0.04)
- North America > United States > New York (0.04)
- (5 more...)
Language Models Represent Beliefs of Self and Others
Zhu, Wentao, Zhang, Zhining, Wang, Yizhou
Understanding and attributing mental states, known as Theory of Mind (ToM), emerges as a fundamental capability for human social reasoning. While Large Language Models (LLMs) appear to possess certain ToM abilities, the mechanisms underlying these capabilities remain elusive. In this study, we discover that it is possible to linearly decode the belief status from the perspectives of various agents through neural activations of language models, indicating the existence of internal representations of self and others' beliefs. By manipulating these representations, we observe dramatic changes in the models' ToM performance, underscoring their pivotal role in the social reasoning process. Additionally, our findings extend to diverse social reasoning tasks that involve different causal inference patterns, suggesting the potential generalizability of these representations.
- Europe > Austria > Vienna (0.14)
- Asia > China (0.04)
- Africa > Middle East > Egypt (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Leisure & Entertainment > Social Events (0.67)
- Education (0.67)
- Health & Medicine > Therapeutic Area > Neurology (0.67)