Performance Analysis
CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving
Yang, Huitong, Chen, Zhuoxiao, Zhang, Fengyi, Huang, Zi, Luo, Yadan
Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.
Accuracy vs. Accuracy: Computational Tradeoffs Between Classification Rates and Utility
Amit, Noga, Reingold, Omer, Rothblum, Guy N.
We revisit the foundations of fairness and its interplay with utility and efficiency in settings where the training data contain richer labels, such as individual types, rankings, or risk estimates, rather than just binary outcomes. In this context, we propose algorithms that achieve stronger notions of evidence-based fairness than are possible in standard supervised learning. Our methods support classification and ranking techniques that preserve accurate subpopulation classification rates, as suggested by the underlying data distributions, across a broad class of classification rules and downstream applications. Furthermore, our predictors enable loss minimization, whether aimed at maximizing utility or in the service of fair treatment. Complementing our algorithmic contributions, we present impossibility results demonstrating that simultaneously achieving accurate classification rates and optimal loss minimization is, in some cases, computationally infeasible. Unlike prior impossibility results, our notions are not inherently in conflict and are simultaneously satisfied by the Bayes-optimal predictor. Furthermore, we show that each notion can be satisfied individually via efficient learning. Our separation thus stems from the computational hardness of learning a sufficiently good approximation of the Bayes-optimal predictor. These computational impossibilities present a choice between two natural and attainable notions of accuracy that could both be motivated by fairness.
Towards Trustworthy Keylogger detection: A Comprehensive Analysis of Ensemble Techniques and Feature Selections through Explainable AI
Keylogger detection involves monitoring for unusual system behaviors such as delays between typing and character display, analyzing network traffic patterns for data exfiltration. In this study, we provide a comprehensive analysis for keylogger detection with traditional machine learning models - SVC, Random Forest, Decision Tree, XGBoost, AdaBoost, Logistic Regression and Naive Bayes and advanced ensemble methods including Stacking, Blending and Voting. Moreover, feature selection approaches such as Information gain, Lasso L1 and Fisher Score are thoroughly assessed to improve predictive performance and lower computational complexity. The Keylogger Detection dataset from publicly available Kaggle website is used in this project. In addition to accuracy-based classification, this study implements the approach for model interpretation using Explainable AI (XAI) techniques namely SHAP (Global) and LIME (Local) to deliver finer explanations for how much each feature contributes in assisting or hindering the detection process. To evaluate the models result, we have used AUC score, sensitivity, Specificity, Accuracy and F1 score. The best performance was achieved by AdaBoost with 99.76% accuracy, F1 score of 0.99, 100% precision, 98.6% recall, 1.0 specificity and 0.99 of AUC that is near-perfect classification with Fisher Score.
BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
Wang, Zifeng, Danek, Benjamin, Sun, Jimeng
Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Xu, Zhangchen, Li, Yuetai, Jiang, Fengqing, Ramasubramanian, Bhaskar, Niu, Luyao, Lin, Bill Yuchen, Poovendran, Radha
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection
Cohen, Benjamin A., Fhima, Jonathan, Meisel, Meishar, Meital, Baskin, Nakayama, Luis Filipe, Berkowitz, Eran, Behar, Joachim A.
Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to learn robust representations from large-scale natural image datasets, enhancing their generalization across domains. In retinal imaging, foundation models pretrained on either natural or ophthalmic data have shown promise, but the benefits of in-domain pretraining remain uncertain. To investigate this, we benchmark six SSL-pretrained ViTs on seven digital fundus image (DFI) datasets totaling 70,000 expert-annotated images for the task of moderate-to-late age-related macular degeneration (AMD) identification. Our results show that iBOT pretrained on natural images achieves the highest out-of-distribution generalization, with AUROCs of 0.80-0.97, outperforming domain-specific models, which achieved AUROCs of 0.78-0.96 and a baseline ViT-L with no pretraining, which achieved AUROCs of 0.68-0.91. These findings highlight the value of foundation models in improving AMD identification and challenge the assumption that in-domain pretraining is necessary. Furthermore, we release BRAMD, an open-access dataset (n=587) of DFIs with AMD labels from Brazil.
An Inclusive Foundation Model for Generalizable Cytogenetics in Precision Oncology
Yang, Changchun, Dai, Weiqian, Zhang, Yilan, Chen, Siyuan, Hu, Jingdong, Su, Junkai, Chen, Yuxuan, Xu, Ao, Li, Na, Gao, Xin, Yu, Yongguo
Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the scarcity of comprehensive datasets spanning diverse resource conditions. Here, we introduce CHROMA, a foundation model for cytogenomics, designed to overcome these challenges by learning generalizable representations of chromosomal abnormalities. Pre -trained on over 84,000 specimens (~4 million chromosomal images) via self -supervised learning, CHROMA outperforms other methods across all types of abnormalities, even when trained on fewer labelled data and more imbalanced datasets. By facilitating comprehensive mapping of instability and clonal leisons across various aberration types, CHROMA offers a scalable and generalizable solution for reliable and automated clinical analysis, reducing the annotation workload for experts and advancing precision oncology through the early detection of rare genomic abnormalities, enabling broad clinical AI applications and making advanced genomic analysis more accessible.
Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy
Islam, Yousuf, Chowdhury, Md. Jalal Uddin, Das, Sumon Chandra
Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset's structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study's findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.
Anomaly Detection Based on Critical Paths for Deep Neural Networks
Zhao, Fangzhen, Zhang, Chenyi, Dong, Naipeng, Li, Ming, Shan, Jinxiao
Deep neural networks (DNNs) are notoriously hard to understand and difficult to defend. Extracting representative paths (including the neuron activation values and the connections between neurons) from DNNs using software engineering approaches has recently shown to be a promising approach in interpreting the decision making process of blackbox DNNs, as the extracted paths are often effective in capturing essential features. With this in mind, this work investigates a novel approach that extracts critical paths from DNNs and subsequently applies the extracted paths for the anomaly detection task, based on the observation that outliers and adversarial inputs do not usually induce the same activation pattern on those paths as normal (in-distribution) inputs. In our approach, we first identify critical detection paths via genetic evolution and mutation. Since different paths in a DNN often capture different features for the same target class, we ensemble detection results from multiple paths by integrating random subspace sampling and a voting mechanism. Compared with state-of-the-art methods, our experimental results suggest that our method not only outperforms them, but it is also suitable for the detection of a broad range of anomaly types with high accuracy.
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications
Megahed, Fadel M., Chen, Ying-Ju, Jones-Farmer, L. Allision, Lee, Younghwa, Wang, Jiawei Brooke, Zwetsloot, Inez M.
LLM-based annotation has become something of an academic Wild West: the lack of established practices and standards has led to concerns about the quality and validity of research. Researchers have warned that the ostensible simplicity of LLMs can be misleading, as they are prone to bias, misunderstandings, and unreliable results [1, p.1]. LLMs outperform typical human annotators. The evidence is consistent across different types of texts and time periods. It strongly suggests that ChatGPT may already be a superior approach compared to crowd annotations on platforms such as MTurk. At the very least, the findings demonstrate the importance of studying the text-annotation properties and capabilities of LLMs more in depth [2, p.2]. Together, these contrasting perspectives highlight the need to critically examine large language models (LLMs) for text annotation and classification. Although human annotation remains widespread, it poses considerable challenges. It is time-consuming and costly--up to $5 per annotation and $50 per hour for annotators [3]--and often suffers from inconsistencies stemming from the intricacies of language and the subjectivity of annotators [4].