Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

Oikarinen, Tuomas, Yan, Ge, Kulkarni, Akshay, Weng, Tsui-Wei

Dec-4-2025–arXiv.org Artificial Intelligence

Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by ~13x. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by ~3x. Together, these techniques reduce the evaluation cost by ~40x, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-4-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.88)

Technology:
- Information Technology
  - Communications > Social Media
    - Crowdsourcing (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Representation & Reasoning (0.93)
    - Machine Learning
      - Neural Networks > Deep Learning (0.94)
      - Performance Analysis > Accuracy (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found