Goto

Collaborating Authors

 Performance Analysis


On the Limits of Selective AI Prediction: A Case Study in Clinical Decision Making

arXiv.org Artificial Intelligence

AI has the potential to augment human decision making. However, even high-performing models can produce inaccurate predictions when deployed. These inaccuracies, combined with automation bias, where humans overrely on AI predictions, can result in worse decisions. Selective prediction, in which potentially unreliable model predictions are hidden from users, has been proposed as a solution. This approach assumes that when AI abstains and informs the user so, humans make decisions as they would without AI involvement. To test this assumption, we study the effects of selective prediction on human decisions in a clinical context. We conducted a user study of 259 clinicians tasked with diagnosing and treating hospitalized patients. We compared their baseline performance without any AI involvement to their AI-assisted accuracy with and without selective prediction. Our findings indicate that selective prediction mitigates the negative effects of inaccurate AI in terms of decision accuracy. Compared to no AI assistance, clinician accuracy declined when shown inaccurate AI predictions (66% [95% CI: 56%-75%] vs. 56% [95% CI: 46%-66%]), but recovered under selective prediction (64% [95% CI: 54%-73%]). However, while selective prediction nearly maintains overall accuracy, our results suggest that it alters patterns of mistakes: when informed the AI abstains, clinicians underdiagnose (18% increase in missed diagnoses) and undertreat (35% increase in missed treatments) compared to no AI input at all. Our findings underscore the importance of empirically validating assumptions about how humans engage with AI within human-AI systems.


Physics-Informed Multimodal Bearing Fault Classification under Variable Operating Conditions using Transfer Learning

arXiv.org Artificial Intelligence

Accurate and interpretable bearing fault classification is critical for ensuring the reliability of rotating machinery, particularly under variable operating conditions where domain shifts can significantly degrade model performance. This study proposes a physics-informed multimodal convolutional neural network (CNN) with a late fusion architecture, integrating vibration and motor current signals alongside a dedicated physics-based feature extraction branch. The model incorporates a novel physics-informed loss function that penalizes physically implausible predictions based on characteristic bearing fault frequencies - Ball Pass Frequency Outer (BPFO) and Ball Pass Frequency Inner (BPFI) - derived from bearing geometry and shaft speed. Comprehensive experiments on the Paderborn University dataset demonstrate that the proposed physics-informed approach consistently outperforms a non-physics-informed baseline, achieving higher accuracy, reduced false classifications, and improved robustness across multiple data splits. To address performance degradation under unseen operating conditions, three transfer learning (TL) strategies - Target-Specific Fine-Tuning (TSFT), Layer-Wise Adaptation Strategy (LAS), and Hybrid Feature Reuse (HFR) - are evaluated. Results show that LAS yields the best generalization, with additional performance gains when combined with physics-informed modeling. Validation on the KAIST bearing dataset confirms the framework's cross-dataset applicability, achieving up to 98 percent accuracy. Statistical hypothesis testing further verifies significant improvements (p < 0.01) in classification performance. The proposed framework demonstrates the potential of integrating domain knowledge with data-driven learning to achieve robust, interpretable, and generalizable fault diagnosis for real-world industrial applications.


ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis

arXiv.org Artificial Intelligence

Proteins are essential macromolecules responsible for the structural and functional mechanisms of living things. Phage structural proteins are a subclass of the protein family corresponding to the bacteriophages, which are the most prevalent kind of biological creatures [1]. The PVPs are mainly associated with building structural constituents of the bacteriophage [2], such as the baseplate and head as shown in Figure 1, enabling efficient host-phage binding for genome transfer. The amino acid sequences, particularly those involved in structure synthesis, exhibit significant diversity among phages and their respective groups [3]. Consequently, characterizing these sequences becomes a challenging yet crucial task, as the shortage of annotations for phage proteins has emerged as a hindrance in numerous research studies focused on phage genomics [4]. With the advent of high-throughput sequencing technology, rapid additions of sequences are being observed in standard biological databases [6] as shown in Figure 1, giving rise to the need for proper annotation of the sequences. Enhancement of annotations for the phage family is vital for exploring effective anti-bacterial drug synthesis [7, 8], disease diagnosis [9], food production [10], bacterial genome remodeling [11], etc. Traditionally, mass spectrometry and protein array techniques [12, 13] were utilized for characterizing these proteins. However, these methods are associated with significant time, labor, and computational expenses [1]. Similarly, alignment-based methods do not always work well with PVP categorization due to the lack of collinearity [14] in viral genomes, resulting from factors such as horizontal gene transfer, high mutation rates, etc., leading to dissimilar sequences 2 Figure 1 Structural constituents of Phage


Fine-Tuning Large Language Models Using EEG Microstate Features for Mental Workload Assessment

arXiv.org Artificial Intelligence

This study explores the intersection of electroencephalography (EEG) microstates and Large Language Models (LLMs) to enhance the assessment of cognitive load states. By utilizing EEG microstate features, the research aims to fine-tune LLMs for improved predictions of distinct cognitive states, specifically 'Rest' and 'Load'. The experimental design is delineated in four comprehensive stages: dataset collection and preprocessing, microstate segmentation and EEG backfitting, feature extraction paired with prompt engineering, and meticulous LLM model selection and refinement. Employing a supervised learning paradigm, the LLM is trained to identify cognitive load states based on EEG microstate features integrated into prompts, producing accurate discrimination of cognitive load. A curated dataset, linking EEG features to specified cognitive load conditions, underpins the experimental framework. The results indicate a significant improvement in model performance following the proposed fine-tuning, showcasing the potential of EEG-informed LLMs in cognitive neuroscience and cognitive AI applications. This approach not only contributes to the understanding of brain dynamics but also paves the way for advancements in machine learning techniques applicable to cognitive load and cognitive AI research.


PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets

arXiv.org Artificial Intelligence

Reliable seizure detection is critical for diagnosing and managing epilepsy, yet clinical workflows remain dependent on time-consuming manual EEG interpretation. While machine learning has shown promise, existing approaches often rely on dataset-specific optimisations, limiting their real-world applicability and reproducibility. Here, we introduce an innovative, open-source machine-learning framework that enables robust and generalisable seizure detection across varied clinical datasets. We evaluate our approach on two publicly available EEG datasets that differ in patient populations and electrode configurations. To enhance robustness, the framework incorporates an automated pre-processing pipeline to standardise data and a majority voting mechanism, in which multiple models independently assess each second of EEG before reaching a final decision. We train, tune, and evaluate models within each dataset, assessing their cross-dataset transferability. Our models achieve high within-dataset performance (AUC 0.904+/-0.059 for CHB-MIT and 0.864+/-0.060 for TUSZ) and demonstrate strong generalisation across datasets despite differences in EEG setups and populations (AUC 0.615+/-0.039 for models trained on CHB-MIT and tested on TUSZ and 0.762+/-0.175 in the reverse case) without any post-processing. Furthermore, a mild post-processing improved the within-dataset results to 0.913+/-0.064 and 0.867+/-0.058 and cross-dataset results to 0.619+/-0.036 and 0.768+/-0.172. These results underscore the potential of, and essential considerations for, deploying our framework in diverse clinical settings. By making our methodology fully reproducible, we provide a foundation for advancing clinically viable, dataset-agnostic seizure detection systems. This approach has the potential for widespread adoption, complementing rather than replacing expert interpretation, and accelerating clinical integration.


Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation

arXiv.org Artificial Intelligence

Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks.


Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays

arXiv.org Artificial Intelligence

Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity--especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models--Generative Adversarial Networks (GANs) and Diffusion Models (DMs)--for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.


Improving Real-Time Concept Drift Detection using a Hybrid Transformer-Autoencoder Framework

arXiv.org Artificial Intelligence

In applied machine learning, concept drift, which is either gradual or abrupt changes in data distribution, can significantly reduce model performance. Typical detection methods,such as statistical tests or reconstruction-based models,are generally reactive and not very sensitive to early detection. Our study proposes a hybrid framework consisting of Transformers and Autoencoders to model complex temporal dynamics and provide online drift detection. We create a distinct Trust Score methodology, which includes signals on (1) statistical and reconstruction-based drift metrics, more specifically, PSI, JSD, Transformer-AE error, (2) prediction uncertainty, (3) rules violations, and (4) trend of classifier error aligned with the combined metrics defined by the Trust Score. Using a time sequenced airline passenger data set with synthetic drift, our proposed model allows for a better detection of drift using as a whole and at different detection thresholds for both sensitivity and interpretability compared to baseline methods and provides a strong pipeline for drift detection in real time for applied machine learning. We evaluated performance using a time-sequenced airline passenger dataset having the gradually injected stimulus of drift in expectations,e.g. permuted ticket prices in later batches, broken into 10 time segments [1].In the data, our results support that the Transformation-Autoencoder detected drift earlier and with more sensitivity than the autoencoders commonly used in the literature, and provided improved modeling over more error rates and logical violations. Therefore, a robust framework was developed to reliably monitor concept drift.


A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images

arXiv.org Artificial Intelligence

Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.


Robust Anomaly Detection in Network Traffic: Evaluating Machine Learning Models on CICIDS2017

arXiv.org Artificial Intelligence

Identifying suitable machine learning paradigms for intrusion detection remains critical for building effective and generalizable security solutions. In this study, we present a controlled comparison of four representative models - Multi-Layer Perceptron (MLP), 1D Convolutional Neural Network (CNN), One-Class Support Vector Machine (OCSVM) and Local Outlier Factor (LOF) - on the CICIDS2017 dataset under two scenarios: detecting known attack types and generalizing to previously unseen threats. Our results show that supervised MLP and CNN achieve near-perfect accuracy on familiar attacks but suffer drastic recall drops on novel attacks. Unsupervised LOF attains moderate overall accuracy and high recall on unknown threats at the cost of elevated false alarms, while boundary-based OCSVM balances precision and recall best, demonstrating robust detection across both scenarios. These findings offer practical guidance for selecting IDS models in dynamic network environments.