Performance Analysis
ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment
Hevia, Anthony, Chintalapati, Sanjana, Lai, Veronica Ka Wai, Nguyen, Thanh Tam, Wong, Wai-Tat, Klassen, Terry, Wang, Lucy Lu
We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.
Evaluating Control Protocols for Untrusted AI Agents
Kutasov, Jon, Loughridge, Chloe, Sun, Yuqi, Sleight, Henry, Shlegeris, Buck, Tracy, Tyler, Benton, Joe
As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.
NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction
Li, Zhongmin, Ma, Runze, Tan, Jiahao, Tan, Chengzi, Zheng, Shuangjia
Nucleotide sequence variation can induce significant shifts in functional fitness. Recent nucleotide foundation models promise to predict such fitness effects directly from sequence, yet heterogeneous datasets and inconsistent preprocessing make it difficult to compare methods fairly across DNA and RNA families. Here we introduce NABench, a large-scale, systematic benchmark for nucleic acid fitness prediction. NABench aggregates 162 high-throughput assays and curates 2.6 million mutated sequences spanning diverse DNA and RNA families, with standardized splits and rich metadata. We show that NABench surpasses prior nucleotide fitness benchmarks in scale, diversity, and data quality. Under a unified evaluation suite, we rigorously assess 29 representative foundation models across zero-shot, few-shot prediction, transfer learning, and supervised settings. The results quantify performance heterogeneity across tasks and nucleic-acid types, demonstrating clear strengths and failure modes for different modeling choices and establishing strong, reproducible baselines. We release NABench to advance nucleic acid modeling, supporting downstream applications in RNA/DNA design, synthetic biology, and biochemistry. Our code is available at https://github.com/mrzzmrzz/NABench.
Consciousness-ECG Transformer for Conscious State Estimation System with Real-Time Monitoring
Kweon, Young-Seok, Shin, Gi-Hwan, Kim, Ji-Yong, Ryu, Bokyeong, Lee, Seong-Whan
Conscious state estimation is important in various medical settings, including sleep staging and anesthesia management, to ensure patient safety and optimize health outcomes. Traditional methods predominantly utilize electroencephalography (EEG), which faces challenges such as high sensitivity to noise and the requirement for controlled environments. In this study, we propose the consciousness-ECG transformer that leverages electrocardiography (ECG) signals for non-invasive and reliable conscious state estimation. Our approach employs a transformer with decoupled query attention to effectively capture heart rate variability features that distinguish between conscious and unconscious states. We implemented the conscious state estimation system with real-time monitoring and validated our system on datasets involving sleep staging and anesthesia level monitoring during surgeries. Experimental results demonstrate that our model outperforms baseline models, achieving accuracies of 0.877 on sleep staging and 0.880 on anesthesia level monitoring. Moreover, our model achieves the highest area under curve values of 0.786 and 0.895 on sleep staging and anesthesia level monitoring, respectively. The proposed system offers a practical and robust alternative to EEG-based methods, particularly suited for dynamic clinical environments. Our results highlight the potential of ECG-based consciousness monitoring to enhance patient safety and advance our understanding of conscious states.
Aegis: A Correlation-Based Data Masking Advisor for Data Sharing Ecosystems
Laskar, Omar Islam, Khozestani, Fatemeh Ramezani, Nankani, Ishika, Nia, Sohrab Namazi, Roy, Senjuti Basu, Beedkar, Kaustubh
Data sharing ecosystems connect providers, consumers, and intermediaries to facilitate the exchange and use of data for a wide range of downstream tasks. In sensitive domains such as healthcare, privacy is enforced as a hard constraint, any shared data must satisfy a minimum privacy threshold. However, among all masking configurations that meet this requirement, the utility of the masked data can vary significantly, posing a key challenge: how to efficiently select the optimal configuration that preserves maximum utility. This paper presents Aegis, a middleware framework that selects optimal masking configurations for machine learning datasets with features and class labels. Aegis incorporates a utility optimizer that minimizes predictive utility deviation, quantifying shifts in feature label correlations due to masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature label correlation quantification methods such as mutual information, chi square, or g3. Our experimental evaluation of real world datasets shows that Aegis identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks on par with baseline approaches and complements privacy anonymization data masking techniques.
Probability Density from Latent Diffusion Models for Out-of-Distribution Detection
Jรคrve, Joonas, Haavel, Karl Kaspar, Kull, Meelis
Despite rapid advances in AI, safety remains the main bottleneck to deploying machine-learning systems. A critical safety component is out-of-distribution detection: given an input, decide whether it comes from the same distribution as the training data. In generative models, the most natural OOD score is the data likelihood. Actually, under the assumption of uniformly distributed OOD data, the likelihood is even the optimal OOD detector, as we show in this work. However, earlier work reported that likelihood often fails in practice, raising doubts about its usefulness. We explore whether, in practice, the representation space also suffers from the inability to learn good density estimation for OOD detection, or if it is merely a problem of the pixel space typically used in generative models. To test this, we trained a Variational Diffusion Model not on images, but on the representation space of a pre-trained ResNet-18 to assess the performance of our likelihood-based detector in comparison to state-of-the-art methods from the OpenOOD suite.
ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs
Zhang, Ben, Yu, LuLu, Gao, Lei, Guo, QuanJiang, Liu, Jing, Gao, Hui
Abstract--During reasoning in vision-language models (VLMs), false positive (FP) reasoning occurs when a model produces the correct answer but follows an incorrect reasoning path, resulting in undermined reasoning reliability. Existing approaches mainly rely on prompt engineering, knowledge distillation or reinforcement learning to improve reasoning reliability, both of which require large amounts of high-quality data and thus limit practical applicability. Few approaches have focused on directly detecting and correcting FPs. T o address these issues, we propose ViFP, a framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs. ViFP builds effective reasoning paths through multi-turn QA and dynamically analyzes the consistency of the reasoning path to identify potential FPs. It also introduces a targeted reasoning chain correction mechanism to modify FP reasoning, thereby improving logical consistency and accuracy. Finally, we introduce a reliability evaluation metric--V oC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OK-VQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep
Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
Colorectal Cancer Histopathological Grading using Multi-Scale Federated Learning
Arafath, Md Ahasanul, Ghosh, Abhijit Kumar, Ahmed, Md Rony, Afroz, Sabrin, Hosen, Minhazul, Moon, Md Hasan, Reza, Md Tanzim, Alam, Md Ashad
Colorectal cancer (CRC) grading is a critical prognostic factor but remains hampered by inter-observer variability and the privacy constraints of multi-institutional data sharing. While deep learning offers a path to automation, centralized training models conflict with data governance regulations and neglect the diagnostic importance of multi-scale analysis. In this work, we propose a scalable, privacy-preserving federated learning (FL) framework for CRC histopathological grading that integrates multi-scale feature learning within a distributed training paradigm. Our approach employs a dual-stream ResNetRS50 backbone to concurrently capture fine-grained nuclear detail and broader tissue-level context. This architecture is integrated into a robust FL system stabilized using FedProx to mitigate client drift across heterogeneous data distributions from multiple hospitals. Extensive evaluation on the CRC-HGD dataset demonstrates that our framework achieves an overall accuracy of 83.5%, outperforming a comparable centralized model (81.6%). Crucially, the system excels in identifying the most aggressive Grade III tumors with a high recall of 87.5%, a key clinical priority to prevent dangerous false negatives. Performance further improves with higher magnification, reaching 88.0% accuracy at 40x. These results validate that our federated multi-scale approach not only preserves patient privacy but also enhances model performance and generalization. The proposed modular pipeline, with built-in preprocessing, checkpointing, and error handling, establishes a foundational step toward deployable, privacy-aware clinical AI for digital pathology.
The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles
M, Abhinav P, Saxena, Ojasva, C, Oswald, Krishnamurthy, Parameswari
The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.