Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Bărbălau, Antonio, Păduraru, Cristian Daniel, Poncu, Teodor, Tifrea, Alexandru, Burceanu, Elena

Dec-8-2025–arXiv.org Artificial Intelligence

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretabil-ity and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P T op-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies T op-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.Figure 1: Sample generation demonstrating behavioral steering interventions on Llama 3 8B Instruct prompted to produce a sycophantic opinion. We apply two Sparse Autoencoder (SAE)-based methods to remove sycophancy: the conventional decoder-centric Masked Reconstruction approach and our proposed encoder-centric S&P Top-K protocol. Lower LLM-as-a-judge sycophancy scores indicate superior mitigation of the targeted behavioral pattern. The results illustrate that conventional Masked Reconstruction fails to suppress sycophantic behavior, while our S&P Top-K intervention successfully redirects the model's output, eliminating direct praise, repeatedly deferring endorsement, and leading the model to ultimately employ laudatory language in a sarcastic manner that subverts the original sycophantic intent. The main steps of our approach are highlighted in green. We first employ a selection mechanism to identify relevant SAE features.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Dec-8-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.28)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine > Therapeutic Area (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.69)
    - Statistical Learning > Regression (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found