AITopics

Neural Information Processing SystemsNov-15-2025, 07:16:57 GMT

Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse sub-graphs of models corresponding to specific behaviors or capabilities.

attribution, transcoder, transcoder feature, (16 more...)

Country:

North America > United States > Connecticut > New Haven County > New Haven (0.04)
Europe > Monaco (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Ye, Mengyu, Suzuki, Jun, Inaba, Tatsuro, Kuribayashi, Tatsuki

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

arXiv.org Machine LearningOct-28-2025

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2510.22332

Country: Asia > Japan (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Golimblevskaia, Elena, Jain, Aakriti, Puri, Bruno, Ibrahim, Ammar, Samek, Wojciech, Lapuschkin, Sebastian

Circuit Insights: Towards Interpretability Beyond Activations

arXiv.org Artificial IntelligenceOct-17-2025

The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.

large language model, machine learning, natural language, (20 more...)

2510.14936

Country: Europe > Austria (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Zhao, Zheng, Koishekenov, Yeskendir, Yang, Xianjun, Murray, Naila, Cancedda, Nicola

Verifying Chain-of-Thought Reasoning via Its Computational Graph

arXiv.org Artificial IntelligenceOct-13-2025

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

large language model, machine learning, natural language, (18 more...)

2510.09312

Country: North America > United States (0.93)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry: Transportation (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsOct-9-2025, 21:50:04 GMT

2b8f4db0464cc5b6e9d5e6bea4b9f308-Paper-Conference.pdf

attribution, transcoder, transcoder feature, (16 more...)

Country:

North America > United States > Connecticut > New Haven County > New Haven (0.04)
Europe > Monaco (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Quirke, Lucia, Shabalin, Stepan, Belrose, Nora

Binary Sparse Coding for Interpretability

arXiv.org Artificial IntelligenceOct-1-2025

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

artificial intelligence, machine learning, natural language, (15 more...)

2509.25596

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Hosokawa, Sosuke, Kawakami, Toshiharu, Kodera, Satoshi, Ito, Masamichi, Takeda, Norihiko

Transcoder-based Circuit Analysis for Interpretable Single-Cell Foundation Models

arXiv.org Artificial IntelligenceSep-19-2025

Single-cell foundation models (scFMs) have demonstrated state-of-the-art performance on various tasks, such as cell-type annotation and perturbation response prediction, by learning gene regulatory networks from large-scale transcriptome data. However, a significant challenge remains: the decision-making processes of these models are less interpretable compared to traditional methods like differential gene expression analysis. Recently, transcoders have emerged as a promising approach for extracting interpretable decision circuits from large language models (LLMs). In this work, we train a transcoder on the cell2sentence (C2S) model, a state-of-the-art scFM. By leveraging the trained transcoder, we extract internal decision-making circuits from the C2S model. We demonstrate that the discovered circuits correspond to real-world biological mechanisms, confirming the potential of transcoders to uncover biologically plausible pathways within complex single-cell models.

large language model, machine learning, natural language, (18 more...)

2509.14723

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.18)

Genre: Research Report > Promising Solution (0.49)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Neural Information Processing SystemsAug-17-2025, 04:06:05 GMT

A Data and

We tried removing and keeping the comments in the code from our training data. As shown in Table 6, keeping the comments gives better results overall. Detailed statistics of the resulting dataset can be found in Table 3. We give the size in GigaBytes, the number of files and functions, and the number of tokens. We show two versions of the same Python function and their common tokenization.

machine learning, programming language, python, (17 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.67)
Information Technology > Software > Programming Languages (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.47)

Neural Information Processing SystemsAug-17-2025, 04:05:58 GMT

Unsupervised Translation of Programming Languages

A transcompiler, transpiler, or source-to-source compiler, is a translator which converts between programming languages that operate at a similar level of abstraction.

machine learning, programming language, translation, (18 more...)

Country:

Oceania > Australia (0.04)
North America > Canada (0.04)
Europe > France (0.04)

Industry: Banking & Finance (0.68)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)