iia
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Sutter, Denis, Minder, Julian, Hofmann, Thomas, Pimentel, Tiago
The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (7 more...)
- Government > Regional Government (0.45)
- Transportation > Ground (0.34)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions
Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (12 more...)
Punctuation and Predicates in Language Models
Chauhan, Sonakshi, Chaudhary, Maheep, Choy, Koby, Nellessen, Samuel, Schoots, Nandi
In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > Gelderland > Nijmegen (0.04)
- Asia > Middle East > Jordan (0.04)
Emergent Symbol-like Number Variables in Artificial Neural Networks
Grant, Satchel, Goodman, Noah D., McClelland, James L.
What types of numeric representations emerge in Neural Networks (NNs)? To what degree do NNs induce abstract, mutable, slot-like numeric variables, and in what situations do these representations emerge? How do these representations change over learning, and how can we understand the neural implementations in ways that are unified across different NNs? In this work, we approach these questions by first training sequence based neural systems using Next Token Prediction (NTP) objectives on numeric tasks. We then seek to understand the neural solutions through the lens of causal abstractions or symbolic algorithms. We use a combination of causal interventions and visualization methods to find that artificial neural models do indeed develop analogs of interchangeable, mutable, latent number variables purely from the NTP objective. We then ask how variations on the tasks and model architectures affect the models' learned solutions to find that these symbol-like numeric representations do not form for every variant of the task, and transformers solve the problem in a notably different way than their recurrent counterparts. We then show how the symbol-like variables change over the course of training to find a strong correlation between the models' task performance and the alignment of their symbol-like representations. Lastly, we show that in all cases, some degree of gradience exists in these neural symbols, highlighting the difficulty of finding simple, interpretable symbolic stories of how neural networks perform numeric tasks. Taken together, our results are consistent with the view that neural networks can approximate interpretable symbolic programs of number cognition, but the particular program they approximate and the extent to which they approximate it can vary widely, depending on the network architecture, training data, extent of training, and network size.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
RLHF and IIA: Perverse Incentives
Xu, Wanqiao, Dong, Shi, Lu, Xiuyuan, Lam, Grace, Wen, Zheng, Van Roy, Benjamin
Modern generative AIs ingest trillions of data bytes from the World Wide Web to produce a large pretrained model. Trained to imitate what is observed, this model represents an agglomeration of behaviors, some of which are more or less desirable to mimic. Further training through human interaction, even on fewer than a hundred thousand bits of data, has proven to greatly enhance usefulness and safety, enabling the remarkable AIs we have today. This process of reinforcement learning from human feedback (RLHF) steers AIs toward the more desirable among behaviors observed during pretraining. While AIs now routinely generate drawings, music, speech, and computer code, the text-based chatbot remains an emblematic artifact.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China (0.04)
- Research Report (0.82)
- Personal > Honors (0.46)
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Wu, Zhengxuan, Geiger, Atticus, Huang, Jing, Arora, Aryaman, Icard, Thomas, Potts, Christopher, Goodman, Noah D.
We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
Visual Explanations via Iterated Integrated Attributions
Barkan, Oren, Elisha, Yehonatan, Asher, Yuval, Eshel, Amit, Koenigstein, Noam
We introduce Iterated Integrated Attributions (IIA) - a generic method for explaining the predictions of vision models. IIA employs iterative integration across the input image, the internal representations generated by the model, and their gradients, yielding precise and focused explanation maps. We demonstrate the effectiveness of IIA through comprehensive evaluations across various tasks, datasets, and network architectures. Our results showcase that IIA produces accurate explanation maps, outperforming other state-of-the-art explanation techniques.
- North America > United States > Rocky Mountains (0.04)
- North America > Canada > Rocky Mountains (0.04)
- North America > United States > Maine (0.04)
- (2 more...)
Axioms for Defeat in Democratic Elections
Holliday, Wesley H., Pacuit, Eric
We propose six axioms concerning when one candidate should defeat another in a democratic election involving two or more candidates. Five of the axioms are widely satisfied by known voting procedures. The sixth axiom is a weakening of Kenneth Arrow's famous condition of the Independence of Irrelevant Alternatives (IIA). We call this weakening Coherent IIA. We prove that the five axioms plus Coherent IIA single out a method of determining defeats studied in our recent work: Split Cycle. In particular, Split Cycle provides the most resolute definition of defeat among any satisfying the six axioms for democratic defeat. In addition, we analyze how Split Cycle escapes Arrow's Impossibility Theorem and related impossibility results.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (10 more...)