Goto

Collaborating Authors

 physician



Appendix A Proofs of Formal Claims

Neural Information Processing Systems

By pre-training the model on domain-specific data, PubMED BERT is expected to have a better understanding of biomedical concepts, terminology, and language patterns compared to general domain models like BERT -base and BERT -large [ 95 ]. The main advantage of using PubMED BERT for biomedical text mining tasks is its domain-specific knowledge, which can lead to improved performance and more accurate results when fine-tuned on various downstream tasks, such as named entity recognition, relation extraction, document classification, and question answering. Since PubMED BERT is pre-trained on a large corpus of biomedical text, it is better suited to capturing the unique language patterns, complex terminology, and the relationships between entities in the biomedical domain.


Auditing for Human Expertise

Neural Information Processing Systems

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor.We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs ('features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI'complementarity' is achievable in a given prediction task.We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that - even absent normative concerns about accountability or interpretability - accuracy is insufficient to justify algorithmic automation.


The Role of Doctors Is Changing Forever

The New Yorker

Others say they don't need us. It's time for us to think of ourselves not as the high priests of health care but as what we have always been: healers. Not long ago, I cared for a middle-aged man I'll call Jim, who was generally healthy but had recently started to feel sluggish. One of his friends told him to try a hormone supplement. After Jim saw on social media that Robert F. Kennedy, Jr., the Trump Administration's Secretary of Health and Human Services, had endorsed supplements as a part of an "anti-aging" regimen, he ordered one from a telehealth company. A few months later, he noticed swelling and pain in his calf. ChatGPT warned him that he might have a blood clot.


Experts urge caution as Trump's big bill incentivizes AI in healthcare

The Guardian

Experts urge caution as Trump's big bill incentivizes AI in healthcare For states to receive certain funding stipulated in the Trump administration's "big, beautiful" bill, they must meet three of 10 criteria - including integrating more artificial intelligence ( AI) technology in healthcare settings - which experts say could have major benefits and liabilities for under-resourced hospitals, depending on how it's implemented. The Rural Health Transformation Fund is a carveout that will provide $50bn over a period of five years to states who meet certain application criteria, including "consumer-facing, technology-driven solutions for the prevention and management of chronic diseases," and "providing training and technical assistance for the development and adoption of technology-enabled solutions that improve care delivery in rural hospitals, including remote monitoring, robotics, artificial intelligence, and other advanced technologies". Analysts have noted that this $50bn will not be nearly enough to make up for the Congressional Budget Office's projected $911bn reduction in Medicaid spending over the next decade under the bill (Obba). These cuts will affect both patients who lose free health coverage under Medicaid, and hospitals who benefit from those patients' Medicaid reimbursements. Chenhao Tan, associate professor of data science at the University of Chicago, and Karni Chagal-Feferkorn, an assistant professor at the University of South Florida's college of AI and cybersecurity, said AI technology could provide major benefits to rural hospitals that are frequently under-resourced and under-staffed.


Patient Safety Risks from AI Scribes: Signals from End-User Feedback

Dai, Jessica, Huang, Anwen, Nasrallah, Catherine, Croci, Rhiannon, Soleimani, Hossein, Pollet, Sarah J., Adler-Milstein, Julia, Murray, Sara G., Yazdany, Jinoos, Chen, Irene Y.

arXiv.org Artificial Intelligence

AI scribes are transforming clinical documentation at scale. However, their real-world performance remains understudied, especially regarding their impacts on patient safety. To this end, we initiate a mixed-methods study of patient safety issues raised in feedback submitted by AI scribe users (healthcare providers) in a large U.S. hospital system. Both quantitative and qualitative analysis suggest that AI scribes may induce various patient safety risks due to errors in transcription, most significantly regarding medication and treatment; however, further study is needed to contextualize the absolute degree of risk.


PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Maqbool, Danyal, Lee, Changhee, Huemann, Zachary, Church, Samuel D., Larson, Matthew E., Perlman, Scott B., Romero, Tomas A., Warner, Joshua D., Lubner, Meghan, Tie, Xin, Merkow, Jameson, Hu, Junjie, Cho, Steve Y., Bradshaw, Tyler J.

arXiv.org Artificial Intelligence

Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians -- the first of its kind for automated PET reporting -- confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.


The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

Liu, Dou, Long, Ying, Zuoqiu, Sophia, Xie, Kaipeng, Yang, Runze, Liu, Di, Li, Kang, Lin, Yiting, Liu, Hanyi, Yin, Rong, Tang, Tian

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians' original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and may diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.