Performance Analysis
Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy
Pradhan, Gauri, Jälkö, Joonas, Zanella-Bèguelin, Santiago, Honkela, Antti
Training machine learning models with differential privacy (DP) limits an adversary's ability to infer sensitive information about the training data. It can be interpreted as a bound on adversary's capability to distinguish two adjacent datasets according to chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.
Multiclass threshold-based classification and model evaluation
Legnaro, Edoardo, Guastavino, Sabrina, Marchetti, Francesco
In this paper, we introduce a threshold-based framework for multiclass classification that generalizes the standard argmax rule. This is done by replacing the probabilistic interpretation of softmax outputs with a geometric one on the multidimensional simplex, where the classification depends on a multidimensional threshold. This change of perspective enables for any trained classification network an \textit{a posteriori} optimization of the classification score by means of threshold tuning, as usually carried out in the binary setting, thus allowing for a further refinement of the prediction capability of any network. Our experiments show indeed that multidimensional threshold tuning yields performance improvements across various networks and datasets. Moreover, we derive a multiclass ROC analysis based on \emph{ROC clouds} -- the attainable (FPR,TPR) operating points induced by a single multiclass threshold -- and summarize them via a \emph{Distance From Point} (DFP) score to $(0,1)$. This yields a coherent alternative to standard One-vs-Rest (OvR) curves and aligns with the observed tuning gains.
The Rapid Growth of AI Foundation Model Usage in Science
Trišović, Ana, Fogelson, Alex, Sivaloganathan, Janakan, Thompson, Neil
We present the first large-scale analysis of AI foundation model usage in science - not just citations or keywords. We find that adoption has grown rapidly, at nearly-exponential rates, with the highest uptake in Linguistics, Computer Science, and Engineering. Vision models are the most used foundation models in science, although language models' share is growing. Open-weight models dominate. As AI builders increase the parameter counts of their models, scientists have followed suit but at a much slower rate: in 2013, the median foundation model built was 7.7x larger than the median one adopted in science, by 2024 this had jumped to 26x. We also present suggestive evidence that scientists' use of these smaller models may be limiting them from getting the full benefits of AI-enabled science, as papers that use larger models appear in higher-impact journals and accrue more citations.
An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features
Anees, Shabbir, Anshuman, null, Chaurasia, Ayush, Bogar, Prathmesh
It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.
Manifold-Aware Diffusion-Augmented Contrastive Learning for Noise-Robust Biosignal Representation
Learning robust representations for physiological time-series signals continues to pose a substantial challenge in developing efficient few-shot learning applications. This difficulty is largely due to the complex pathological variations in biosignals. In this context, this paper introduces a manifold-aware Diffusion-Augmented Contrastive Learning (DACL) framework, which efficiently leverages the generative structure of latent diffusion models with the discriminative power of supervised contrastive learning. The proposed framework operates within a contextualized scattering latent space derived from Scattering Transformer (ST) features. Within a contrastive learning framework, we employ a forward diffusion process in the scattering latent space as a structured manifold-aware feature augmentation technique. We assessed the proposed framework using the PhysioNet 2017 ECG benchmark dataset. The proposed method achieved a competitive AUROC of 0.9741 in the task of detecting atrial fibrillation from a single-lead ECG signal. The proposed framework achieved performance on par with relevant state-of-the-art related works. In-depth evaluation findings suggest that early-stage diffusion serves as an ideal "local manifold explorer," producing embeddings with greater precision than typical augmentation methods while preserving inference efficiency.
ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset
Lutu, Adrian Catalin, Pintilie, Ioana, Burceanu, Elena, Manolache, Andrei
We present ChronoGraph, a graph-structured multivariate time series forecasting dataset built from real-world production microservices. Each node is a service that emits a multivariate stream of system-level performance metrics, capturing CPU, memory, and network usage patterns, while directed edges encode dependencies between services. The primary task is forecasting future values of these signals at the service level. In addition, ChronoGraph provides expert-annotated incident windows as anomaly labels, enabling evaluation of anomaly detection methods and assessment of forecast robustness during operational disruptions. Compared to existing benchmarks from industrial control systems or traffic and air-quality domains, ChronoGraph uniquely combines (i) multivariate time series, (ii) an explicit, machine-readable dependency graph, and (iii) anomaly labels aligned with real incidents. We report baseline results spanning forecasting models, pretrained time-series foundation models, and standard anomaly detectors. ChronoGraph offers a realistic benchmark for studying structure-aware forecasting and incident-aware evaluation in microservice systems.
Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Hong, Yunqi, An, Sohyun, Bai, Andrew, Lin, Neil Y. C., Hsieh, Cho-Jui
Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP
CDR-Agent: Intelligent Selection and Execution of Clinical Decision Rules Using Large Language Model Agents
Xiang, Zhen, Hsu, Aliyah R., Zane, Austin V., Kornblith, Aaron E., Lin-Martore, Margaret J., Kaur, Jasmanpreet C., Dokiparthi, Vasuda M., Li, Bo, Yu, Bin
Clinical decision-making is inherently complex and fast-paced, particularly in emergency departments (EDs) where critical, rapid and high-stakes decisions are made. Clinical Decision Rules (CDRs) are standardized evidence-based tools that combine signs, symptoms, and clinical variables into decision trees to make consistent and accurate diagnoses. CDR usage is often hindered by the clinician's cognitive load, limiting their ability to quickly recall and apply the appropriate rules. We introduce CDR-Agent, a novel LLM-based system designed to enhance ED decision-making by autonomously identifying and applying the most appropriate CDRs based on unstructured clinical notes. To validate CDR-Agent, we curated two novel ED datasets: synthetic and CDR-Bench, although CDR-Agent is applicable to non ED clinics. CDR-Agent achieves a 56.3\% (synthetic) and 8.7\% (CDR-Bench) accuracy gain relative to the standalone LLM baseline in CDR selection. Moreover, CDR-Agent significantly reduces computational overhead. Using these datasets, we demonstrated that CDR-Agent not only selects relevant CDRs efficiently, but makes cautious yet effective imaging decisions by minimizing unnecessary interventions while successfully identifying most positively diagnosed cases, outperforming traditional LLM prompting approaches. Code for our work can be found at: https://github.com/zhenxianglance/medagent-cdr-agent
Estimation in high-dimensional linear regression: Post-Double-Autometrics as an alternative to Post-Double-Lasso
Hué, Sullivan, Laurent, Sébastien, Aiounou, Ulrich, Flachaire, Emmanuel
Post-Double-Lasso is becoming the most popular method for estimating linear regression models with many covariates when the purpose is to obtain an accurate estimate of a parameter of interest, such as an average treatment effect. However, this method can suffer from substantial omitted variable bias in finite sample. We propose a new method called Post-Double-Autometrics, which is based on Autometrics, and show that this method outperforms Post-Double-Lasso.
How to Correctly Report LLM-as-a-Judge Evaluations
Lee, Chungpa, Zeng, Thomas, Jeong, Jongwon, Sohn, Jy-yong, Lee, Kangwook
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.