Goto

Collaborating Authors

 Janiak, Jett


Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

arXiv.org Artificial Intelligence

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.


Evaluating Synthetic Activations composed of SAE Latents in GPT-2

arXiv.org Artificial Intelligence

Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model's activations at an early layer results in a step-function-like change in the model's final layer activations. Furthermore, the model's sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple "bag of SAE latents" lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.


Characterizing stable regions in the residual stream of LLMs

arXiv.org Artificial Intelligence

We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.


An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

arXiv.org Artificial Intelligence

In recent years, large language models (LLMs) have made impressive gains in capability (Vaswani et al. 2017; Devlin et al. 2019; OpenAI 2023; Radford et al. 2019; Brown et al. 2020), often surpassing expectations (Wei et al. 2022). However, these models remain poorly understood, with their successes and failures largely unexplained. Understanding what LLMs learn and how they generate predictions is therefore an increasingly urgent scientific and practical challenge. Mechanistic interpretability (MI) aims to reverse engineer models into human-understandable algorithms or circuits (Geiger et al. 2021; Olah 2022; Wang et al. 2022), attempting to avoid pitfalls such as illusory understanding. With MI, we can identify and fix model errors (Vig et al. 2020; Hernandez et al. 2022; Meng et al. 2023; Hase et al. 2023), steer its outputs (Li et al. 2023), and explain emergent behaviors (Nanda et al. 2023; Barak et al. 2023). The central goals in MI are (a) localization: identifying the specific model components (attention heads, MLP layers) that the circuit is composed of; and (b) explaining the behavior of these components.