Europe
Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks
Recent work in machine learning increasingly attributes human-like capabilities such as reasoning or theory of mind to large language models (LLMs) on the basis of benchmark performance. This paper examines this practice through the lens of construct validity, understood as the problem of linking theoretical capabilities to their empirical measurements. It contrasts three influential frameworks: the nomological account developed by Cronbach and Meehl, the inferential account proposed by Messick and refined by Kane, and Borsboom's causal account. I argue that the nomological account provides the most suitable foundation for current LLM capability research. It avoids the strong ontological commitments of the causal account while offering a more substantive framework for articulating construct meaning than the inferential account. I explore the conceptual implications of adopting the nomological account for LLM research through a concrete case: the assessment of reasoning capabilities in LLMs.
Fast Uncertainty Quantification for Kernel-Based Estimators in Large-Scale Causal Inference
Kosko, Matthew, J, Falco, Bargagli-Stoffi, null, Wang, Lin, Santacatterina, Michele
Kernel methods are widely used in causal inference for tasks such as treatment effect estimation, policy evaluation, and policy learning. The bootstrap is a standard tool for uncertainty quantification because of its broad applicability. As increasingly large datasets become available, such as the 2023 U.S. Natality data from the National Vital Statistics System (NVSS), which includes 3,596,017 registered births, the computational demands of these methods increase substantially. Kernel methods are known to scale poorly with sample size, and this limitation is further exacerbated by the repeated re-fitting required by the bootstrap. As a result, bootstrap-based inference for kernel-based estimators can become computationally infeasible in large-scale settings. In this paper, we address these challenges by extending the causal Bag of Little Bootstraps (cBLB) algorithm to kernel methods. Our approach achieves computational scalability by combining subsampling and resampling while preserving first-order uncertainty quantification and asymptotically correct coverage. We evaluate the method across three representative implementations: kernelized augmented outcome-weighted learning, kernel-based minimax weighting, and double machine learning with kernel support vector machines. We show in simulations that our method yields confidence intervals with nominal coverage at a fraction of the computational cost. We further demonstrate its utility in a real-world application by estimating the effect of any amount of smoking on birth weight, as well as the optimal treatment regime, using the NVSS dataset, where the standard bootstrap is prohibitively expensive computationally and effectively infeasible at this scale.
Scalable Simulation-Based Model Inference with Test-Time Complexity Control
Gloeckler, Manuel, Manzano-Patrón, J. P., Sotiropoulos, Stamatios N., Schröder, Cornelius, Macke, Jakob H.
Simulation plays a central role in scientific discovery. In many applications, the bottleneck is no longer running a simulator; it is choosing among large families of plausible simulators, each corresponding to different forward models/hypotheses consistent with observations. Over large model families, classical Bayesian workflows for model selection are impractical. Furthermore, amortized model selection methods typically hard-code a fixed model prior or complexity penalty at training time, requiring users to commit to a particular parsimony assumption before seeing the data. We introduce PRISM, a simulation-based encoder-decoder that infers a joint posterior over both discrete model structures and associated continuous parameters, while enabling test-time control of model complexity via a tunable model prior that the network is conditioned on. We show that PRISM scales to families with combinatorially many (up to billions) of model instantiations on a synthetic symbolic regression task. As a scientific application, we evaluate PRISM on biophysical modeling for diffusion MRI data, showing the ability to perform model selection across several multi-compartment models, on both synthetic and in vivo neuroimaging data.
EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection
Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.
The Sampling Complexity of Condorcet Winner Identification in Dueling Bandits
Saad, El Mehdi, Thuot, Victor, Verzelen, Nicolas
We study best-arm identification in stochastic dueling bandits under the sole assumption that a Condorcet winner exists, i.e., an arm that wins each noisy pairwise comparison with probability at least $1/2$. We introduce a new identification procedure that exploits the full gap matrix $Δ_{i,j}=q_{i,j}-\tfrac12$ (where $q_{i,j}$ is the probability that arm $i$ beats arm $j$), rather than only the gaps between the Condorcet winner and the other arms. We derive high-probability, instance-dependent sample-complexity guarantees that (up to logarithmic factors) improve the best known ones by leveraging informative comparisons beyond those involving the winner. We complement these results with new lower bounds which, to our knowledge, are the first for Condorcet-winner identification in stochastic dueling bandits. Our lower-bound analysis isolates the intrinsic cost of locating informative entries in the gap matrix and estimating them to the required confidence, establishing the optimality of our non-asymptotic bounds. Overall, our results reveal new regimes and trade-offs in the sample complexity that are not captured by asymptotic analyses based only on the expected budget.
Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization
Dai, Yanning, Wang, Yuhui, Ashley, Dylan R., Schmidhuber, Jürgen
Morphology-control co-design concerns the coupled optimization of an agent's body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control's adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control's adaptation dynamics into morphology optimization. By modeling this intrinsic coupling, our method aligns morphology updates with control adaptation, thereby stabilizing training and improving learning efficiency. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance, opening the way for dramatically more efficient robotics designs.
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise
Kar, Avik, Chandak, Siddharth, Singh, Rahul, Moulines, Eric, Bhatnagar, Shalabh, Bambos, Nicholas
We present the first uniform-in-time high-probability bound for SGD under the PL condition, where the gradient noise contains both Markovian and martingale difference components. This significantly broadens the scope of finite-time guarantees, as the PL condition arises in many machine learning and deep learning models while Markovian noise naturally arises in decentralized optimization and online system identification problems. We further allow the magnitude of noise to grow with the function value, enabling the analysis of many practical sampling strategies. In addition to the high-probability guarantee, we establish a matching $1/k$ decay rate for the expected suboptimality. Our proof technique relies on the Poisson equation to handle the Markovian noise and a probabilistic induction argument to address the lack of almost-sure bounds on the objective. Finally, we demonstrate the applicability of our framework by analyzing three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.
When Should Humans Step In? Optimal Human Dispatching in AI-Assisted Decisions
Tan, Lezhi, Sagan, Naomi, Lei, Lihua, Blanchet, Jose
AI systems increasingly assist human decision making by producing preliminary assessments of complex inputs. However, such AI-generated assessments can often be noisy or systematically biased, raising a central question: how should costly human effort be allocated to correct AI outputs where it matters the most for the final decision? We propose a general decision-theoretic framework for human-AI collaboration in which AI assessments are treated as factor-level signals and human judgments as costly information that can be selectively acquired. We consider cases where the optimal selection problem reduces to maximizing a reward associated with each candidate subset of factors, and turn policy design into reward estimation. We develop estimation procedures under both nonparametric and linear models, covering contextual and non-contextual selection rules. In the linear setting, the optimal rule admits a closed-form expression with a clear interpretation in terms of factor importance and residual variance. We apply our framework to AI-assisted peer review. Our approach substantially outperforms LLM-only predictions and achieves performance comparable to full human review while using only 20-30% of the human information. Across different selection rules, we find that simpler rules derived under linear models can significantly reduce computational cost without harming final prediction performance. Our results highlight both the value of human intervention and the efficiency of principled dispatching.
Child abuse material 'systemic' on Elon Musk's X amid Grok scandal, Australian online safety regulator warned
Australia's eSafety commissioner wrote to X in January after its AI chatbot Grok was used to generate sexualised images of women and children online. Australia's eSafety commissioner wrote to X in January after its AI chatbot Grok was used to generate sexualised images of women and children online. Child abuse material'systemic' on Elon Musk's X amid Grok scandal, Australian online safety regulator warned The Australian online safety regulator warned Elon Musk's X amid the Grok sexualised image generation scandal that it found child abuse material was "particularly systemic" on X and more accessible than on "any other mainstream service", correspondence obtained by Guardian Australia reveals. The eSafety commissioner wrote to X in January after its chatbot, Grok, was used to generate sexualised images of women and children online, which the prime minister, Anthony Albanese, described as "abhorrent". In the letter, obtained by Guardian Australia under freedom of information laws, eSafety's general manager of regulatory operations, Heidi Snell, pointed to Musk's promise when taking over the platform in 2022 that "removing child exploitation is priority #1", but said "the availability of CSEM [child sexual exploitation material] continues to appear particularly systemic on X".