Goto

Collaborating Authors

 Michalson, Arne Edward


Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

arXiv.org Artificial Intelligence

Radiologists play a crucial role by translating medical images into medical reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning (SFT). Meanwhile, in the general domain, additional preference fine-tuning has become standard practice. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback. We propose a scalable automated preference alignment technique for VLMs in radiology, focusing on chest X-ray (CXR) report generation. Our method leverages publicly available datasets with an LLM-as-a-Judge mechanism, eliminating the need for additional expert radiologist feedback. We evaluate and benchmark five direct alignment algorithms (DAAs). Our results show up to a 57.4% improvement in average GREEN scores, a LLM-based metric for evaluating CXR reports, and a 9.2% increase in an average across six metrics (domain specific and general), compared to the SFT baseline. We study reward overoptimization via length exploitation, with reports lengthening by up to 3.2x. To assess a potential alignment tax, we benchmark on six additional diverse tasks, finding no significant degradations. A reader study involving four board-certified radiologists indicates win rates of up to 0.62 over the SFT baseline, while significantly penalizing verbosity. Our analysis provides actionable insights for the development of VLMs in high-stakes fields like radiology.


GREEN: Generative Radiology Report Evaluation and Error Notation

arXiv.org Artificial Intelligence

Machine learning has enabled great progress in the automatic interpretation of images, where vision language models (VLMs) translate features of images into text (Radford et al., 2021; Liu et al., 2024). In the medical domain, patient images are interpreted by radiologists, Evaluating radiology reports is a challenging which is referred to as radiology report generation problem as factual correctness is extremely important (RRG). Automated and high-quality RRG has due to the need for accurate medical the potential to greatly reduce the repetitive work of communication about medical images. Existing radiologists, alleviate burdens arising from shortage automatic evaluation metrics either suffer of radiologists, generally improve clinical communication from failing to consider factual correctness (Kahn Jr et al., 2009), and increase the accuracy (e.g., BLEU and ROUGE) or are limited of radiology reports (Rajpurkar and Lungren, 2023). in their interpretability (e.g., F1CheXpert Commonly used evaluation metrics in RRG literature and F1RadGraph). In this paper, we introduce (Lin, 2004; Zhang et al., 2019; Smit et al., 2020; GREEN (Generative Radiology Report Evaluation Delbrouck et al., 2022) seek to evaluate a generated and Error Notation), a radiology report radiology report against a reference report written by generation metric that leverages the natural language a radiologist by leveraging simple n-grams overlap, understanding of language models to general language similarity, pathology identification identify and explain clinically significant errors within specific imaging modalities and disease classes, in candidate reports, both quantitatively and commercially-available large language models.