Performance Analysis
Refining Norms: APost-hoc Framework for OOD Detection in Graph Neural Networks
Graph Neural Networks (GNNs) are increasingly deployed in mission-critical tasks, yet they often encounter inputs that lie outside their training distribution, leading to unreliable or overconfident predictions. To address this limitation, we present RAGNOR (Robust Aggregation Graph Norm for Outlier Recognition), a post-hoc approach that leverages embedding norms for robust out-of-distribution (OOD) detection on both node-level and graph-level tasks. Unlike previous methods designed primarily for image domains, RAGNOR directly tackles the relational challenges intrinsic to graphs: local contamination by anomalous neighbors, disparate norm scales across classes or roles, and insufficient references for boundary or low-degree nodes.
CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models
Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-yourown coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.
SAFE: Multitask Failure Detection for Vision-Language-Action Models
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient highlevel knowledge about task success and failure, which is generic across different tasks.
Selective Omniprediction and Fair Abstention
We propose new learning algorithms for building selective classifiers, which are predictors that are allowed to abstain on some fraction of the domain. We study the model where a classifier may abstain from predicting at a fixed cost. Building on the recent framework on multigroup fairness and omniprediction, given a prespecified class of loss functions, we provide an algorithm for building a single classifier that learns abstentions and predictions optimally for every loss in the entire class, where the abstentions are decided efficiently for each specific loss function by applying a fixed post-processing function. Our algorithm and theoretical guarantees generalize the previously-known algorithms for learning selective classifiers in formal learning-theoretic models. We then extend the traditional multigroup fairness algorithms to the selective classification setting and show that we can use a calibrated and multiaccurate predictor to efficiently build selective classifiers that abstain optimally not only globally but also locally within each of the groups in any pre-specified collection of possibly intersecting subgroups of the domain, and are also accurate when they do not abstain. We show how our abstention algorithms can be used as conformal prediction methods in the binary classification setting to achieve both marginal and group-conditional coverage guarantees for an intersecting collection of groups. We provide empirical evaluations for all of our theoretical results, demonstrating the practicality of our learning algorithms for abstaining optimally and fairly.
Majority of the Bests: Improving Best-of-N via Bootstrapping
Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping.
Exploiting LLMs for Automatic Hypothesis Assessment via a Based Calibrated Prior
As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask-given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses.
Correcting misinterpretations of additive models
Correct model interpretation in high-stakes settings is critical, yet both post-hoc feature attribution methods and so-called intrinsically interpretable models can systematically attribute false-positive importance to non-informative features such as suppressor variables. Specifically, both linear models and their powerful nonlinear generalisation such as General Additive Models (GAMs) are susceptible to spurious attributions to suppressors. We present a principled generalisation of activation patterns - originally developed to make linear models interpretable - to additive models, correctly rejecting suppressor effects for non-linear features. This yields PatternGAM, an importance attribution method based on univariate generative surrogate models for the broad family of additive models, and PatternQLR for polynomial models. Empirical evaluations on the XAI-TRIS benchmark with a novel false-negative invariant formulation of the earth mover's distance accuracy metric demonstrates significant improvements over popular feature attribution methods and the traditional interpretation of additive models. Finally, real-world case studies on the COMPAS and MIMIC-IV datasets provide new insights into the role of specific features by disentangling genuine target-related information from suppression effects that would mislead conventional GAM interpretations.
When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product
State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment.
Practical Bayes-Optimal Membership Inference Attacks
We develop practical and theoretically grounded membership inference attacks (MIAs) against both independent and identically distributed (i.i.d.) data and graphstructured data. Building on the Bayesian decision-theoretic framework of [1], we derive the Bayes-optimal membership inference rule for node-level MIAs against graph neural networks, addressing key open questions about optimal query strategies in the graph setting. We introduce BASE and G-BASE, tractable approximations of the Bayes-optimal membership inference. G-BASE achieves superior performance compared to previously proposed classifier-based node-level MIA attacks. BASE, which is also applicable to non-graph data, matches or exceeds the performance of prior state-of-the-art MIAs, such as LiRA and RMIA, at a significantly lower computational cost. Finally, we show that BASE and RMIA are equivalent under a specific hyperparameter setting, providing a principled, Bayes-optimal justification for the RMIA attack.