Performance Analysis
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Zhong, Ziqian, Raghunathan, Aditi
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.
Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimer's disease
Roy, Hugues, Dorent, Reuben, Burgos, Ninon
Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer's disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on V AEs ( ฮฒ -VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.
From Perception Logs to Failure Modes: Language-Driven Semantic Clustering of Failures for Robot Safety
Gupta, Aryaman, Ciftci, Yusuf Umut, Bansal, Somil
As robotic systems become increasingly integrated into real-world environments -- ranging from autonomous vehicles to household assistants -- they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful failure clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can benefit online failure monitoring systems, offering a lightweight yet powerful safeguard for real-time operation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.
Navigating the Latent Space Dynamics of Neural Models
Fumero, Marco, Moschella, Luca, Rodolร , Emanuele, Locatello, Francesco
Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interpretation of neural models as dynamical systems acting on the latent manifold. Specifically, we show that autoencoder models implicitly define a latent vector field on the manifold, derived by iteratively applying the encoding-decoding map, without any additional training. We observe that standard training procedures introduce inductive biases that lead to the emergence of attractor points within this vector field. Drawing on this insight, we propose to leverage the vector field as a representation for the network, providing a novel tool to analyze the properties of the model and the data. This representation enables to: (i) analyze the generalization and memorization regimes of neural models, even throughout training; (ii) extract prior knowledge encoded in the network's parameters from the attractors, without requiring any input data; (iii) identify out-of-distribution samples from their trajectories in the vector field. We further validate our approach on vision foundation models, showcasing the applicability and effectiveness of our method in real-world scenarios.
ToolCritic: Detecting and Correcting Tool-Use Errors in Dialogue Systems
Hamad, Hassan, Xu, Yingru, Zhao, Liang, Yan, Wenbo, Gyanchandani, Narendra
Tool-augmented large language models (LLMs) are increasingly employed in real-world applications, but tool usage errors still hinder their reliability. We introduce ToolCritic, a diagnostic framework that evaluates and improves LLM behavior in multi-turn, tool-augmented dialogues. ToolCritic detects eight distinct error types specific to tool-calling (e.g., premature invocation, argument misalignment, and misinterpretation of tool outputs) and provides targeted feedback to the main LLM. The main LLM, assumed to have strong reasoning, task understanding and orchestration capabilities, then revises its response based on ToolCritic's feedback. We systematically define these error categories and construct a synthetic dataset to train ToolCritic. Experimental results on the Schema-Guided Dialogue (SGD) dataset demonstrate that ToolCritic improves tool-calling accuracy by up to 13% over baselines, including zero-shot prompting and self-correction techniques. This represents a promising step toward more robust LLM integration with external tools in real-world dialogue applications.
Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures
Li, Pingzhi, Huang, Morris Yu-Chao, Tan, Zhen, Song, Qingquan, Peng, Jie, Zou, Kai, Cheng, Yu, Xu, Kaidi, Chen, Tianlong
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning
Wang, Yingxu, Zhang, Kunyu, Huang, Jiaxin, Yin, Nan, Liu, Siwei, Segal, Eran
Multimodal molecular representation learning, which jointly models molecular graphs and their textual descriptions, enhances predictive accuracy and interpretability by enabling more robust and reliable predictions of drug toxicity, bioactivity, and physicochemical properties through the integration of structural and semantic information. However, existing multimodal methods suffer from two key limitations: (1) they typically perform cross-modal interaction only at the final encoder layer, thus overlooking hierarchical semantic dependencies; (2) they lack a unified prototype space for robust alignment between modalities. To address these limitations, we propose ProtoMol, a prototype-guided multimodal framework that enables fine-grained integration and consistent semantic alignment between molecular graphs and textual descriptions. ProtoMol incorporates dual-branch hierarchical encoders, utilizing Graph Neural Networks to process structured molecular graphs and Transformers to encode unstructured texts, resulting in comprehensive layer-wise representations. Then, ProtoMol introduces a layer-wise bidirectional cross-modal attention mechanism that progressively aligns semantic features across layers. Furthermore, a shared prototype space with learnable, class-specific anchors is constructed to guide both modalities toward coherent and discriminative representations. Extensive experiments on multiple benchmark datasets demonstrate that ProtoMol consistently outperforms state-of-the-art baselines across a variety of molecular property prediction tasks.
A three-step machine learning approach to predict market bubbles with financial news
This study presents a three-step machine learning framework to predict bubbles in the S&P 500 stock market by combining financial news sentiment with macroeconomic indicators. Building on traditional econometric approaches, the proposed approach predicts bubble formation by integrating textual and quantitative data sources. In the first step, bubble periods in the S&P 500 index are identified using a right-tailed unit root test, a widely recognized real-time bubble detection method. The second step extracts sentiment features from large-scale financial news articles using natural language processing (NLP) techniques, which capture investor's expectations and behavioral patterns. In the final step, ensemble learning methods are applied to predict bubble occurrences based on both sentiment-based and macroeconomic predictors. Model performance is evaluated through k-fold cross-validation and compared against benchmark machine learning algorithms. Empirical results indicate that the proposed three-step ensemble approach significantly improves predictive accuracy and robustness, providing valuable early warning insights for investors, regulators, and policymakers in mitigating systemic financial risks.
Predicting life satisfaction using machine learning and explainable AI
Khan, Alif Elham, Hasan, Mohammad Junayed, Anjum, Humayra, Mohammed, Nabeel, Momen, Sifat
Life satisfaction is a crucial facet of human well-being. Hence, research on life satisfaction is incumbent for understanding how individuals experience their lives and influencing interventions targeted at enhancing mental health and well-being. Life satisfaction has traditionally been measured using analog, complicated, and frequently error-prone methods. These methods raise questions concerning validation and propagation. However, this study demonstrates the potential for machine learning algorithms to predict life satisfaction with a high accuracy of 93.80% and a 73.00% macro F1-score. The dataset comes from a government survey of 19000 people aged 16-64 years in Denmark. Using feature learning techniques, 27 significant questions for assessing contentment were extracted, making the study highly reproducible, simple, and easily interpretable. Furthermore, clinical and biomedical large language models (LLMs) were explored for predicting life satisfaction by converting tabular data into natural language sentences through mapping and adding meaningful counterparts, achieving an accuracy of 93.74% and macro F1-score of 73.21%. It was found that life satisfaction prediction is more closely related to the biomedical domain than the clinical domain. Ablation studies were also conducted to understand the impact of data resampling and feature selection techniques on model performance. Moreover, the correlation between primary determinants with different age brackets was analyzed, and it was found that health condition is the most important determinant across all ages. This study demonstrates how machine learning, large language models and XAI can jointly contribute to building trust and understanding in using AI to investigate human behavior, with significant ramifications for academics and professionals working to quantify and comprehend subjective well-being.
Detecting Adversarial Fine-tuning with Auditing Agents
Egler, Sarah, Schulman, John, Carlini, Nicholas
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM can bypass safeguards. Particularly concerning, such attacks may avoid detection with datasets that are only implicitly harmful. Our work studies robust detection mechanisms for adversarial use of fine-tuning APIs. We introduce the concept of a fine-tuning auditing agent and show it can detect harmful fine-tuning prior to model deployment. We provide our auditing agent with access to the fine-tuning dataset, as well as the fine-tuned and pre-fine-tuned models, and request the agent assigns a risk score for the fine-tuning job. We evaluate our detection approach on a diverse set of eight strong fine-tuning attacks from the literature, along with five benign fine-tuned models, totaling over 1400 independent audits. These attacks are undetectable with basic content moderation on the dataset, highlighting the challenge of the task. With the best set of affordances, our auditing agent achieves a 56.2% detection rate of adversarial fine-tuning at a 1% false positive rate. Most promising, the auditor is able to detect covert cipher attacks that evade safety evaluations and content moderation of the dataset. While benign fine-tuning with unintentional subtle safety degradation remains a challenge, we establish a baseline configuration for further work in this area. We release our auditing agent at https://github.com/safety-research/finetuning-auditor.