Goto

Collaborating Authors

 Performance Analysis



Efficient Prediction of Pass@k Scaling in Large Language Models

arXiv.org Machine Learning

Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.


Object Centric Concept Bottlenecks

arXiv.org Artificial Intelligence

Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.


From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

arXiv.org Artificial Intelligence

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct, particularly after fine-tuning. This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique challenges inherent to both rule-based and model-based verifiers and provide insights toward developing more accurate and robust reward systems for reinforcement learning.


Tracing Multilingual Factual Knowledge Acquisition in Pretraining

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.


Downsized and Compromised?: Assessing the Faithfulness of Model Compression

arXiv.org Artificial Intelligence

In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.


LLM-FS-Agent: A Deliberative Role-based Large Language Model Architecture for Transparent Feature Selection

arXiv.org Artificial Intelligence

High-dimensional data remains a pervasive challenge in machine learning, often undermining model interpretability and computational efficiency. While Large Language Models (LLMs) have shown promise for dimensionality reduction through feature selection, existing LLM-based approaches frequently lack structured reasoning and transparent justification for their decisions. This paper introduces LLM-FS-Agent, a novel multi-agent architecture designed for interpretable and robust feature selection. The system orchestrates a deliberative "debate" among multiple LLM agents, each assigned a specific role, enabling collective evaluation of feature relevance and generation of detailed justifications. We evaluate LLM-FS-Agent in the cybersecurity domain using the CIC-DIAD 2024 IoT intrusion detection dataset and compare its performance against strong baselines, including LLM-Select and traditional methods such as PCA. Experimental results demonstrate that LLM-FS-Agent consistently achieves superior or comparable classification performance while reducing downstream training time by an average of 46% (statistically significant improvement, p = 0.028 for XGBoost). These findings highlight that the proposed deliberative architecture enhances both decision transparency and computational efficiency, establishing LLM-FS-Agent as a practical and reliable solution for real-world applications.


An Attention-Augmented VAE-BiLSTM Framework for Anomaly Detection in 12-Lead ECG Signals

arXiv.org Artificial Intelligence

Anomaly detection in 12-lead electrocardiograms (ECGs) is critical for identifying deviations associated with cardiovascular disease. This work presents a comparative analysis of three autoencoder-based architectures: con-volutional autoencoder (CAE), variational autoencoder with bidirectional long short-term memory (VAE-BiLSTM), and VAE-BiLSTM with multi-head attention (VAE-BiLSTM-MHA), for unsupervised anomaly detection in ECGs. T o the best of our knowledge, this study reports the first application of a VAE-BiLSTM-MHA architecture to ECG anomaly detection. All models are trained on normal ECG samples to reconstruct non-anomalous cardiac morphology and detect deviations indicative of disease. Using a unified preprocessing and evaluation pipeline on the public China Physiological Signal Challenge (CPSC) dataset, the attention-augmented VAE achieves the best performance, with an AUPRC of 0.81 and a recall of 0.85 on the held-out test set, outperforming the other architectures. T o support clinical triage, this model is further integrated into an interactive dashboard that visualizes anomaly localization. In addition, a performance comparison with baseline models from the literature is provided.


Kaputt: A Large-Scale Dataset for Visual Defect Detection

arXiv.org Artificial Intelligence

We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.


DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

arXiv.org Artificial Intelligence

Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.