Goto

Collaborating Authors

 pneumonia


Misaligned by Design: Incentive Failures in Machine Learning

Autor, David, Caplin, Andrew, Martin, Daniel, Marx, Philip

arXiv.org Artificial Intelligence

The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Because of this, artificial intelligence (AI) models used to assist such decisions are frequently trained with asymmetric loss functions that incorporate human decision-makers' trade-offs between false positives and false negatives. In two focal applications, we show that this standard alignment practice can backfire. In both cases, it would be better to train the machine learning model with a loss function that ignores the human's objective and then adjust predictions ex post according to that objective. We rationalize this result using an economic model of incentive design with endogenous information acquisition. The key insight from our theoretical framework is that machine classifiers perform not one but two incentivized tasks: choosing how to classify and learning how to classify. We show that while the adjustments engineers use correctly incentivize choosing, they can simultaneously reduce the incentives to learn. Our formal treatment of the problem reveals that methods embraced for their intuitive appeal can in fact misalign human and machine objectives in predictable ways.


What Causes Postoperative Aspiration?

Nagesh, Supriya, Covarrubias, Karina, El-Kareh, Robert, Kasiviswanathan, Shiva Prasad, Mishra, Nina

arXiv.org Artificial Intelligence

Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model to predict postoperative aspiration, enabling timely preventative interventions. Methods: From the MIMIC-IV database of over 400,000 hospital admissions, we identified 826 surgical patients (mean age: 62, 55.7\% male) who experienced aspiration within seven days post-surgery, along with a matched non-aspiration cohort. Three ML models: XGBoost, Multilayer Perceptron, and Random Forest were trained using pre-surgical hospitalization data to predict postoperative aspiration. To investigate causation, we estimated Average Treatment Effects (ATE) using Augmented Inverse Probability Weighting. Results: Our ML model achieved an AUROC of 0.86 and 77.3\% sensitivity on a held-out test set. Maximum daily opioid dose, length of stay, and patient age emerged as the most important predictors. ATE analysis identified significant causative factors: opioids (0.25 +/- 0.06) and operative site (neck: 0.20 +/- 0.13, head: 0.19 +/- 0.13). Despite equal surgery rates across genders, men were 1.5 times more likely to aspirate and received 27\% higher maximum daily opioid dosages compared to women. Conclusion: ML models can effectively predict postoperative aspiration risk, enabling targeted preventative measures. Maximum daily opioid dosage and operative site significantly influence aspiration risk. The gender disparity in both opioid administration and aspiration rates warrants further investigation. These findings have important implications for improving postoperative care protocols and aspiration prevention strategies.


Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images

Prahallad, Pragna, Prahallad, Pranathi

arXiv.org Artificial Intelligence

In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74\%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.


DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

Jia, Zhihao, Jia, Mingyi, Duan, Junwen, Wang, Jianxin

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling the two sub-tasks and optimizing them with distinct objectives through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task. The code is available at https://github.com/zh-jia/DDO.


Deep Learning-Based Pneumonia Detection from Chest X-ray Images: A CNN Approach with Performance Analysis and Clinical Implications

Dutta, P K, Chowdhury, Anushri, Bhattacharyya, Anouska, Chakraborty, Shakya, Dey, Sujatra

arXiv.org Artificial Intelligence

Deep learning integration into medical imaging systems has transformed disease detection and diagnosis processes with a focus on pneumonia identification. The study introduces an intricate deep learning system using Convolutional Neural Networks for automated pneumonia detection from chest Xray images which boosts diagnostic precision and speed. The proposed CNN architecture integrates sophisticated methods including separable convolutions along with batch normalization and dropout regularization to enhance feature extraction while reducing overfitting. Through the application of data augmentation techniques and adaptive learning rate strategies the model underwent training on an extensive collection of chest Xray images to enhance its generalization capabilities. A convoluted array of evaluation metrics such as accuracy, precision, recall, and F1 score collectively verify the model exceptional performance by recording an accuracy rate of 91. This study tackles critical clinical implementation obstacles such as data privacy protection, model interpretability, and integration with current healthcare systems beyond just model performance. This approach introduces a critical advancement by integrating medical ontologies with semantic technology to improve diagnostic accuracy. The study enhances AI diagnostic reliability by integrating machine learning outputs with structured medical knowledge frameworks to boost interpretability. The findings demonstrate AI powered healthcare tools as a scalable efficient pneumonia detection solution. This study advances AI integration into clinical settings by developing more precise automated diagnostic methods that deliver consistent medical imaging results.


A Case for Computing on Unstructured Data

Sadia, Mushtari, Chowdhury, Amrita Roy, Chen, Ang

arXiv.org Artificial Intelligence

Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.


Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

Rafferty, Amy, Ramaesh, Rishi, Rajan, Ajitha

arXiv.org Artificial Intelligence

Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.


Enhancing Radiographic Disease Detection with MetaCheX, a Context-Aware Multimodal Model

He, Nathan, Chen, Cody

arXiv.org Artificial Intelligence

To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X - ray images with structured patient metadata to replicate clinical decision - making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistent ly outperformed radiograph - only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata re duces algorithmic bias and enhances model generalizability across diverse patient populations.


A XAI-based Framework for Frequency Subband Characterization of Cough Spectrograms in Chronic Respiratory Disease

Amado-Caballero, Patricia, San-José-Revuelta, Luis M., Wang, Xinheng, Garmendia-Leiza, José Ramón, Alberola-López, Carlos, Casaseca-de-la-Higuera, Pablo

arXiv.org Artificial Intelligence

This paper presents an explainable artificial intelligence (XAI)-based framework for the spectral analysis of cough sounds associated with chronic respiratory diseases, with a particular focus on Chronic Obstructive Pulmonary Disease (COPD). A Convolutional Neural Network (CNN) is trained on time-frequency representations of cough signals, and occlusion maps are used to identify diagnostically relevant regions within the spectrograms. These highlighted areas are subsequently decomposed into five frequency subbands, enabling targeted spectral feature extraction and analysis. The results reveal that spectral patterns differ across subbands and disease groups, uncovering complementary and compensatory trends across the frequency spectrum. Noteworthy, the approach distinguishes COPD from other respiratory conditions, and chronic from non-chronic patient groups, based on interpretable spectral markers. These findings provide insight into the underlying pathophysiological characteristics of cough acoustics and demonstrate the value of frequency-resolved, XAI-enhanced analysis for biomedical signal interpretation and translational respiratory disease diagnostics.


On the Limits of Selective AI Prediction: A Case Study in Clinical Decision Making

Jabbour, Sarah, Fouhey, David, Banovic, Nikola, Shepard, Stephanie D., Kazerooni, Ella, Sjoding, Michael W., Wiens, Jenna

arXiv.org Artificial Intelligence

AI has the potential to augment human decision making. However, even high-performing models can produce inaccurate predictions when deployed. These inaccuracies, combined with automation bias, where humans overrely on AI predictions, can result in worse decisions. Selective prediction, in which potentially unreliable model predictions are hidden from users, has been proposed as a solution. This approach assumes that when AI abstains and informs the user so, humans make decisions as they would without AI involvement. To test this assumption, we study the effects of selective prediction on human decisions in a clinical context. We conducted a user study of 259 clinicians tasked with diagnosing and treating hospitalized patients. We compared their baseline performance without any AI involvement to their AI-assisted accuracy with and without selective prediction. Our findings indicate that selective prediction mitigates the negative effects of inaccurate AI in terms of decision accuracy. Compared to no AI assistance, clinician accuracy declined when shown inaccurate AI predictions (66% [95% CI: 56%-75%] vs. 56% [95% CI: 46%-66%]), but recovered under selective prediction (64% [95% CI: 54%-73%]). However, while selective prediction nearly maintains overall accuracy, our results suggest that it alters patterns of mistakes: when informed the AI abstains, clinicians underdiagnose (18% increase in missed diagnoses) and undertreat (35% increase in missed treatments) compared to no AI input at all. Our findings underscore the importance of empirically validating assumptions about how humans engage with AI within human-AI systems.