Causal Interpretation of Sparse Autoencoder Features in Vision

Sep-3-2025–arXiv.org Artificial Intelligence

Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with--but does not cause--the feature's firing. W e propose Causal F eature Explanation (CaFE), which levarages Effective Receptive Field (ERF). W e consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mounth.). Patch insertion tests confirm that our CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.

activation, artificial intelligence, machine learning, (12 more...)

arXiv.org Artificial Intelligence

Sep-3-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.54)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found