AITopics | automated circuit discovery

Collaborating Authors

automated circuit discovery

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Automated Circuit Discovery for Mechanistic Interpretability

Neural Information Processing SystemsDec-24-2025, 13:46:28 GMT

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors oftransformer models. This paper systematizes the mechanistic interpretability process they followed. First, researcherschoose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find whichabstract neural network units are involved in the behavior. By varying the dataset, metric, and units underinvestigation, researchers can understand the functionality of each component.We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. Forexample, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes theGreater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found byprevious work.

automated circuit discovery, mechanistic interpretability, name change, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Gu, Hao, Nair, Vibhas, Kumar, Amrithaa Ashok, Sharma, Jayvart, Lagasse, Ryan

arXiv.org Artificial IntelligenceOct-7-2025

Interpreting language models often involves circuit analysis, which aims to identify sparse subnetworks, or circuits, that accomplish specific tasks. Existing circuit discovery algorithms face a fundamental trade-off: attribution patching is fast but unfaithful to the full model, while edge pruning is faithful but computationally expensive. This research proposes a hybrid attribution and pruning (HAP) framework that uses attribution patching to identify a high-potential subgraph, then applies edge pruning to extract a faithful circuit from it. We show that HAP is 46\% faster than baseline algorithms without sacrificing circuit faithfulness. Furthermore, we present a case study on the Indirect Object Identification task, showing that our method preserves cooperative circuit components (e.g. S-inhibition heads) that attribution patching methods prune at high sparsity. Our results show that HAP could be an effective approach for improving the scalability of mechanistic interpretability research to larger models. Our code is available at https://anonymous.4open.science/r/HAP-circuit-discovery.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.03282

Country: Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Towards Automated Circuit Discovery for Mechanistic Interpretability

Neural Information Processing SystemsOct-11-2024, 02:59:06 GMT

automated circuit discovery, mechanistic interpretability, neural network unit, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.75)

Add feedback

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Nainani, Jatin

arXiv.org Artificial IntelligenceJan-7-2024

Large Language Models (LLMs) have experienced a rapid rise in AI, changing a wide range of applications with their advanced capabilities. As these models become increasingly integral to decision-making, the need for thorough interpretability has never been more critical. Mechanistic Interpretability offers a pathway to this understanding by identifying and analyzing specific sub-networks or 'circuits' within these complex systems. A crucial aspect of this approach is Automated Circuit Discovery, which facilitates the study of large models like GPT4 or LLAMA in a feasible manner. In this context, our research evaluates a recent method, Brain-Inspired Modular Training (BIMT), designed to enhance the interpretability of neural networks. We demonstrate how BIMT significantly improves the efficiency and quality of Automated Circuit Discovery, overcoming the limitations of manual methods. Our comparative analysis further reveals that BIMT outperforms existing models in terms of circuit quality, discovery time, and sparsity. Additionally, we provide a comprehensive computational analysis of BIMT, including aspects such as training duration, memory allocation requirements, and inference speed. This study advances the larger objective of creating trustworthy and transparent AI systems in addition to demonstrating how well BIMT works to make neural networks easier to understand.

activation, discovery, interpretability, (14 more...)

arXiv.org Artificial Intelligence

2401.03646

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Attribution Patching Outperforms Automated Circuit Discovery

Syed, Aaquib, Rager, Can, Conmy, Arthur

arXiv.org Artificial IntelligenceNov-20-2023

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

activation, attribution patching, attribution score, (12 more...)

arXiv.org Artificial Intelligence

2310.10348

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.72)

Add feedback