resurrecting
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
O'Neill, Charles, Jayasekara, Mudith, Kirkby, Max
Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear ``dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20\% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and human evaluations confirm that learned features align with clinically meaningful concepts (e.g., ``taste sensations'' or ``infectious mononucleosis''), rather than frequent but uninformative tokens. These domain-specific SAEs capture relevant linear structure, leaving a smaller, more purely nonlinear residual. We conclude that domain-confinement mitigates key limitations of broad-domain SAEs, enabling more complete and interpretable latent decompositions, and suggesting the field may need to question ``foundation-model'' scaling for general-purpose SAEs.
- North America > United States > Texas > Kleberg County (0.24)
- North America > United States > Texas > Chambers County (0.24)
- Europe > United Kingdom > England > Greater London > London (0.14)
- Asia > Middle East > Jordan (0.04)
Resurrecting saturated LLM benchmarks with adversarial encoding
Multiple-choice benchmarks show that Large Language Models (LLMs) excel in many knowledge domains. While LLMs often surpass human performance, recent studies reveal important limitations. For example, the GSM-Symbolic benchmark (Mirzadeh et al., 2024) shows that minor changes in mathematical questions significantly worsen model performance. This suggests LLMs rely on pattern-matching rather than formal reasoning, making them struggle with unfamiliar problem formats. LLMs may also show inconsistent factual recall, performing better under some conditions than others. For example, they often perform worse when presented with multiple tasks simultaneously (Wang, Kodner, & Rambow, 2024). We examine LLM knowledge robustness by testing how well models answer paired questions from multiple-choice benchmarks, and use the identified weaknesses to create a more challenging version of the MMLU benchmark.
Reviews: Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
The article is focused on the problem of understanding the learning dynamics of deep neural networks depending on both the activation functions used at the different layers and on the way the weights are initialized. It is mainly a theoretical paper with some experiments that confirm the theoretical study. The core of the contribution is made based on the random matrix theory. In the first Section, the paper describes the setup -- a deep neural network as a sequence of layers -- and also the tools that will be used to study their dynamics. The analysis mainly relies on the study of the singular values density of the jacobian matrix, this density being computed by a 4 step methods proposed in the article.
Resurrecting loved ones as AI 'ghosts' could harm your mental health
Could your loved one be reborn as an AI? Resurrecting deceased loved ones using artificial intelligence could harm mental health, create dependence on the technology and even spur a new religion, researchers have warned. AI chatbots trained on text from the internet have become ever more capable and convincing in recent years.
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Pennington, Jeffrey, Schoenholz, Samuel, Ganguli, Surya
It is well known that weight initialization in deep networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. Moreover, in deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. However, it is unclear how to achieve dynamical isometry in nonlinear deep networks. We address this question by employing powerful tools from free probability theory to analytically compute the {\it entire} singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity.