AITopics | mechanistic interpretability

Collaborating Authors

mechanistic interpretability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Prompting as Scientific Inquiry

Neural Information Processing SystemsJun-23-2026, 07:56:46 GMT

Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs--few-shot learning, chain-of-thought, constitutional AI--was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of organism--complex, opaque, and trained rather than programmed--then prompting is not a workaround.

large language model, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Interpreting Emergent Features in Deep Learning-based Side-channel Analysis

Neural Information Processing SystemsJun-13-2026, 04:06:49 GMT

Side-channel analysis (SCA) poses a real-world threat by exploiting unintentional physical signals to extract secret information from secure devices. Evaluation labs also use the same techniques to certify device security. In recent years, deep learning has emerged as a prominent method for SCA, achieving state-of-the-art attack performance at the cost of interpretability. Understanding how neural networks extract secrets is crucial for security evaluators aiming to defend against such attacks, as only by understanding the attack can one propose better countermeasures. In this work, we apply mechanistic interpretability to neural networks trained for SCA, revealing $\textit{how}$ models exploit $\textit{what}$ leakage in side-channel traces. We focus on sudden jumps in performance to reverse engineer learned representations, ultimately recovering secret masks and moving the evaluation process from black-box to white-box. Our results show that mechanistic interpretability can scale to realistic SCA settings, even when relevant inputs are sparse, model accuracies are low, and side-channel protections prevent standard input interventions.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

This startup's new mechanistic interpretability tool lets you debug LLMs

MIT Technology ReviewApr-30-2026, 15:59:41 GMT

This startup's new mechanistic interpretability tool lets you debug LLMs Goodfire wants to make training AI models more like good old-fashioned software engineering. The San Francisco-based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters--the settings that determine a model's behavior --during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible. Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model. LLMs contain a LOT of parameters. The company says its mission is to make building AI models less like alchemy and more like a science.

large language model, machine learning, natural language, (17 more...)

MIT Technology Review

Country: North America > United States > California > San Francisco County > San Francisco (0.25)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

b4aadf04d6fde46346db455402860708-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 12:24:53 GMT

artificial intelligence, interpretability, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe > Germany (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

There Will Be a Scientific Theory of Deep Learning

Simon, Jamie, Kunin, Daniel, Atanasov, Alexander, Boix-Adserà, Enric, Bordelon, Blake, Cohen, Jeremy, Ghosh, Nikhil, Guth, Florentin, Jacot, Arthur, Kamb, Mason, Karkada, Dhruva, Michaud, Eric J., Ottlik, Berkan, Turnbull, Joseph

arXiv.org Machine LearningApr-24-2026

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2604.21691

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
North America > United States (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

Identifying interactions at scale for LLMs

AIHubApr-21-2026, 13:37:46 GMT

Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX, algorithms capable of identifying these critical interactions at scale. We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions.

large language model, machine learning, natural language, (18 more...)

AIHub

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Compact Proofs of Model Performance via Mechanistic Interpretability

Neural Information Processing SystemsMar-21-2026, 14:57:39 GMT

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-$K$, validating proof transferability across 151 random seeds and four values of $K$.We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models.Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds.We confirm these connections by qualitatively examining a subset of our proofs.Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

artificial intelligence, name change, proceedings, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Neural Information Processing SystemsFeb-16-2026, 16:33:35 GMT

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical.

artificial intelligence, interpretability, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.05)
Oceania > New Zealand (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Meet the new biologists treating LLMs like aliens

MIT Technology ReviewJan-12-2026, 11:00:00 GMT

By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time. How large is a large language model? Think about it this way. In the center of San Francisco there's a hill called Twin Peaks from which you can view nearly the entire city. Picture all of it--every block and intersection, every neighborhood and park, as far as you can see--covered in sheets of paper. Now picture that paper filled with numbers. LLMs contain a LOT of parameters. That's one way to visualize a large language model, or at least a medium-size one: Printed out in 14-point type, a 200-billion-parameter model, such as GPT4o (released by OpenAI in 2024), could fill 46 square miles of paper--roughly enough to cover San Francisco.

language model, openai, reasoning model, (13 more...)

MIT Technology Review

Country:

North America > United States > California > San Francisco County > San Francisco (0.44)
Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Industry:

Health & Medicine (0.47)
Media (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Neural Information Processing SystemsDec-26-2025, 14:28:00 GMT

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago.

mechanistic interpretability, name change, scale alone, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback