suppression
Improved Representation Steering for Language Models
Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AXBENCH, a large-scale model steering benchmark. On Gemmamodels with sizes ranging from 2Bto 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting - while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFNSuppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG.
Wisdom is Knowing What not to Say Hallucination Free LLMs Unlearning via Attention Shifting
The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of Large Language Models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning.
Improved Representation Steering for Language Models
Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning.
Context Dependence and Reliability in Autoregressive Language Models
Sengupta, Poushali, Pandey, Shashi Raj, Maharjan, Sabita, Eliassen, Frank
Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.
MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
Shen, Guole, Deng, Tianchen, Qin, Xingrui, Wang, Nailin, Wang, Jianyu, Wang, Yanbo, Chen, Yongtao, Wang, Hesheng, Wang, Jingchuan
Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
Defending the Edge: Representative-Attention Defense against Backdoor Attacks in Federated Learning
Obioma, Chibueze Peace, Sun, Youcheng, Mustafa, Mustafa A.
Federated learning (FL) remains highly vulnerable to adaptive backdoor attacks that preserve stealth by closely imitating benign update statistics. Existing defenses predominantly rely on anomaly detection in parameter or gradient space, overlooking behavioral constraints that backdoor attacks must satisfy to ensure reliable trigger activation. These anomaly-centric methods fail against adaptive attacks that normalize update magnitudes and mimic benign statistical patterns while preserving backdoor functionality, creating a fundamental detection gap. To address this limitation, this paper introduces FeRA (Federated Representative Attention) -- a novel attention-driven defense that shifts the detection paradigm from anomaly-centric to consistency-centric analysis. FeRA exploits the intrinsic need for backdoor persistence across training rounds, identifying malicious clients through suppressed representation-space variance, an orthogonal property to traditional magnitude-based statistics. The framework conducts multi-dimensional behavioral analysis combining spectral and spatial attention, directional alignment, mutual similarity, and norm inflation across two complementary detection mechanisms: consistency analysis and norm-inflation detection. Through this mechanism, FeRA isolates malicious clients that exhibit low-variance consistency or magnitude amplification. Extensive evaluation across six datasets, nine attacks, and three model architectures under both Independent and Identically Distributed (IID) and non-IID settings confirm FeRA achieves superior backdoor mitigation. Under different non-IID settings, FeRA achieved the lowest average Backdoor Accuracy (BA), about 1.67% while maintaining high clean accuracy compared to other state-of-the-art defenses. The code is available at https://github.com/Peatech/FeRA_defense.git.