AITopics | iia

Collaborating Authors

iia

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu, Atticus Geiger

Neural Information Processing SystemsFeb-18-2026, 00:22:04 GMT

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Sutter, Denis, Minder, Julian, Hofmann, Thomas, Pimentel, Tiago

arXiv.org Artificial IntelligenceNov-13-2025

The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.08802

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government (0.45)
Transportation > Ground (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

f6a8b109d4d4fd64c75e94aaf85d9697-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 12:03:47 GMT

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

Nguyen, Dang, Tan, Chenhao

arXiv.org Artificial IntelligenceOct-7-2025

Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.06303

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(12 more...)

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Punctuation and Predicates in Language Models

Chauhan, Sonakshi, Chaudhary, Maheep, Choy, Koby, Nellessen, Samuel, Schoots, Nandi

arXiv.org Artificial IntelligenceAug-21-2025

In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.

intervention, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.14067

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Emergent Symbol-like Number Variables in Artificial Neural Networks

Grant, Satchel, Goodman, Noah D., McClelland, James L.

arXiv.org Artificial IntelligenceJan-10-2025

What types of numeric representations emerge in Neural Networks (NNs)? To what degree do NNs induce abstract, mutable, slot-like numeric variables, and in what situations do these representations emerge? How do these representations change over learning, and how can we understand the neural implementations in ways that are unified across different NNs? In this work, we approach these questions by first training sequence based neural systems using Next Token Prediction (NTP) objectives on numeric tasks. We then seek to understand the neural solutions through the lens of causal abstractions or symbolic algorithms. We use a combination of causal interventions and visualization methods to find that artificial neural models do indeed develop analogs of interchangeable, mutable, latent number variables purely from the NTP objective. We then ask how variations on the tasks and model architectures affect the models' learned solutions to find that these symbol-like numeric representations do not form for every variant of the task, and transformers solve the problem in a notably different way than their recurrent counterparts. We then show how the symbol-like variables change over the course of training to find a strong correlation between the models' task performance and the alignment of their symbol-like representations. Lastly, we show that in all cases, some degree of gradience exists in these neural symbols, highlighting the difficulty of finding simple, interpretable symbolic stories of how neural networks perform numeric tasks. Taken together, our results are consistent with the view that neural networks can approximate interpretable symbolic programs of number cognition, but the particular program they approximate and the extent to which they approximate it can vary widely, depending on the network architecture, training data, extent of training, and network size.

artificial intelligence, machine learning, representation, (20 more...)

arXiv.org Artificial Intelligence

2501.06141

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RLHF and IIA: Perverse Incentives

Xu, Wanqiao, Dong, Shi, Lu, Xiuyuan, Lam, Grace, Wen, Zheng, Van Roy, Benjamin

arXiv.org Artificial IntelligenceFeb-1-2024

Modern generative AIs ingest trillions of data bytes from the World Wide Web to produce a large pretrained model. Trained to imitate what is observed, this model represents an agglomeration of behaviors, some of which are more or less desirable to mimic. Further training through human interaction, even on fewer than a hundred thousand bits of data, has proven to greatly enhance usefulness and safety, enabling the remarkable AIs we have today. This process of reinforcement learning from human feedback (RLHF) steers AIs toward the more desirable among behaviors observed during pretraining. While AIs now routinely generate drawings, music, speech, and computer code, the text-based chatbot remains an emblematic artifact.

algorithm, language model, probability, (16 more...)

arXiv.org Artificial Intelligence

2312.01057

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Genre:

Research Report (0.82)
Personal > Honors (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Wu, Zhengxuan, Geiger, Atticus, Huang, Jing, Arora, Aryaman, Icard, Thomas, Potts, Christopher, Goodman, Noah D.

arXiv.org Artificial IntelligenceJan-23-2024

We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.

intervention, name position information, representation, (14 more...)

arXiv.org Artificial Intelligence

2401.12631

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.94)

Industry: Transportation > Ground (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Visual Explanations via Iterated Integrated Attributions

Barkan, Oren, Elisha, Yehonatan, Asher, Yuval, Eshel, Amit, Koenigstein, Noam

arXiv.org Artificial IntelligenceOct-28-2023

We introduce Iterated Integrated Attributions (IIA) - a generic method for explaining the predictions of vision models. IIA employs iterative integration across the input image, the internal representations generated by the model, and their gradients, yielding precise and focused explanation maps. We demonstrate the effectiveness of IIA through comprehensive evaluations across various tasks, datasets, and network architectures. Our results showcase that IIA produces accurate explanation maps, outperforming other state-of-the-art explanation techniques.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.18585

Country:

North America > United States > Rocky Mountains (0.04)
North America > Canada > Rocky Mountains (0.04)
North America > United States > Maine (0.04)
(2 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Axioms for Defeat in Democratic Elections

Holliday, Wesley H., Pacuit, Eric

arXiv.org Artificial IntelligenceOct-12-2023

We propose six axioms concerning when one candidate should defeat another in a democratic election involving two or more candidates. Five of the axioms are widely satisfied by known voting procedures. The sixth axiom is a weakening of Kenneth Arrow's famous condition of the Independence of Irrelevant Alternatives (IIA). We call this weakening Coherent IIA. We prove that the five axioms plus Coherent IIA single out a method of determining defeats studied in our recent work: Split Cycle. In particular, Split Cycle provides the most resolute definition of defeat among any satisfying the six axioms for democratic defeat. In addition, we analyze how Split Cycle escapes Arrow's Impossibility Theorem and related impossibility results.

artificial intelligence, split cycle, voter, (17 more...)

arXiv.org Artificial Intelligence

2008.08451

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(10 more...)

Genre: Research Report (0.81)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback