Compact Proofs of Model Performance via Mechanistic Interpretability

Neural Information Processing Systems

We propose using mechanistic interpretability - techniques for reverse engineering model weights into human-interpretable algorithms - to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-K, validating proof transferability across 151 random seeds and four values of K. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.


A Feature Importance Explanation Methods

Neural Information Processing Systems

We briefly review several FI explanation methods and explain how they are used in this paper. These methods can be classified as gradient-based (1-2), attention-based (3), and perturbation-based (4-7). Note that when computing derivatives of model outputs for explanation methods, we use the logit of the predicted class rather than the predicted probability for purposes of numerical stability. This method estimates the integral in Integrated Gradients [54] by Monte Carlo sampling in order to speed up computation, and it uses the data distribution to obtain baseline inputs. D using the training dataset D. We consider alternative baselines x This approach treats attention weights in a model as an explanation of model feature importance. For the Up-Down model [2], we use its sole set of top-down attention weights, but early experiments suggest this is not an effective method and we do not explore it further.


FIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Neural Information Processing Systems

Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility).


MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making Yubin Kim 1 Chanwoo Park

Neural Information Processing Systems

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named Medical Decision-making Agents (MDAgents) that helps to address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, a simple emulation inspired by the way real-world medical decision-making processes are adapted to tasks of different complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and medical diagnosis benchmarks, including a comparison of LLMs' medical complexity classification against human physicians


Distributed $k$-Clustering for Data with Heavy Noise

Neural Information Processing Systems

In this paper, we consider the k-center/median/means clustering with outliers problems (or the (k, z)-center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on z, the number of outliers. Recently Guha et al. [10] overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with 2z outliers. For the case where z is large, the extra z outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible (1 + ษ›)z, while maintaining the O(1)-approximation ratio and independence of communication cost on z. The problems we consider include the (k, z)-center problem, and (k, z)-median/means problems in Euclidean metrics. Implementation of the our algorithm for (k, z)-center shows that it outperforms many previous algorithms, both in terms of the communication cost and quality of the output solution.


Extracting Relationships by Multi-Domain Matching

Neural Information Processing Systems

In many biological and medical contexts, we construct a large labeled corpus by aggregating many sources to use in target prediction tasks. Unfortunately, many of the sources may be irrelevant to our target task, so ignoring the structure of the dataset is detrimental. This work proposes a novel approach, the Multiple Domain Matching Network (MDMN), to exploit this structure. MDMN embeds all data into a shared feature space while learning which domains share strong statistical relationships. These relationships are often insightful in their own right, and they allow domains to share strength without interference from irrelevant data. This methodology builds on existing distribution-matching approaches by assuming that source domains are varied and outcomes multi-factorial. Therefore, each domain should only match a relevant subset. Theoretical analysis shows that the proposed approach can have a tighter generalization bound than existing multiple-domain adaptation approaches. Empirically, we show that the proposed methodology handles higher numbers of source domains (up to 21 empirically), and provides state-of-the-art performance on image, text, and multi-channel time series classification, including clinical outcome data in an open label trial evaluating a novel treatment for Autism Spectrum Disorder.



688f3fe72241429902623b790f15a774-AuthorFeedback.pdf

Neural Information Processing Systems

Furthermore, the algorithm is scalable and offers competitive experimental results (R2). We hope our detailed response below will further highlight the paper's quality and originality and persuade them to Question 5: Improvements suggested by the reviewers that may yield to a score increase. We thank reviewer for suggesting testing our algorithm on higher-dimensional data. We will add this comparison in the additional page of the final version. Eq. 5 (after including the augmented GP prior), which is analytically intractable.