Goto

Collaborating Authors

 breast


When normalization hallucinates: unseen risks in AI-powered whole slide image processing

Moens, Karel, Blaschko, Matthew B., Tuytelaars, Tinne, Diricx, Bart, De Vylder, Jonas, Yousif, Mustafa

arXiv.org Artificial Intelligence

Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.


The Mysterious Math Behind the Brazilian Butt Lift

WIRED

For years, plastic surgeons thought the proportions of a beautiful buttocks should follow the Fibonacci sequence. Now, people are looking for a more Kardashian shape. In the history of gluteal enhancement, Mexico City stands out. It was here, in 1979, that a plastic surgeon, Mario González-Ulloa, first installed a pair of silicone implants designed specifically for the buttocks. The textbook calls González-Ulloa the "grandfather of buttock augmentation." The early 2000s saw a new generation of Mexico City buttock transformation luminaries, notably Ramón Cuenca-Guerra. Cuenca-Guerra laid out four characteristics that "determine attractive buttocks" as well as the five types of "defects," with strategies for correcting each one. I, for instance, have defect type 5, the "senile buttock." While I understand the value of standardizing procedures and setting guidelines for surgical practice, I tripped over Cuenca-Guerra's methodology. How and by whom had the determinants been determined?


Evaluating Generative AI as an Educational Tool for Radiology Resident Report Drafting

Verdone, Antonio, Cardall, Aidan, Siddiqui, Fardeen, Nashawaty, Motaz, Rigau, Danielle, Kwon, Youngjoon, Yousef, Mira, Patel, Shalin, Kieturakis, Alex, Kim, Eric, Heacock, Laura, Reig, Beatriu, Shen, Yiqiu

arXiv.org Artificial Intelligence

Objective: Radiology residents require timely, personalized feedback to develop accurate image analysis and reporting skills. Increasing clinical workload often limits attendings' ability to provide guidance. This study evaluates a HIPAA-compliant GPT-4o system that delivers automated feedback on breast imaging reports drafted by residents in real clinical settings. Methods: We analyzed 5,000 resident-attending report pairs from routine practice at a multi-site U.S. health system. GPT-4o was prompted with clinical instructions to identify common errors and provide feedback. A reader study using 100 report pairs was conducted. Four attending radiologists and four residents independently reviewed each pair, determined whether predefined error types were present, and rated GPT-4o's feedback as helpful or not. Agreement between GPT and readers was assessed using percent match. Inter-reader reliability was measured with Krippendorff's alpha. Educational value was measured as the proportion of cases rated helpful. Results: Three common error types were identified: (1) omission or addition of key findings, (2) incorrect use or omission of technical descriptors, and (3) final assessment inconsistent with findings. GPT-4o showed strong agreement with attending consensus: 90.5%, 78.3%, and 90.4% across error types. Inter-reader reliability showed moderate variability (α = 0.767, 0.595, 0.567), and replacing a human reader with GPT-4o did not significantly affect agreement (Δ = -0.004 to 0.002). GPT's feedback was rated helpful in most cases: 89.8%, 83.0%, and 92.0%. Discussion: ChatGPT-4o can reliably identify key educational errors. It may serve as a scalable tool to support radiology education.


A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities

Kakileti, Siva Teja, Govindaraju, Bharath, Sampangi, Sudhakar, Manjunath, Geetha

arXiv.org Artificial Intelligence

Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.



Debugging Concept Bottleneck Models through Removal and Retraining

Enouen, Eric, Galhotra, Sainyam

arXiv.org Artificial Intelligence

Concept Bottleneck Models (CBMs) use a set of human-interpretable concepts to predict the final task label, enabling domain experts to not only validate the CBM's predictions, but also intervene on incorrect concepts at test time. However, these interventions fail to address systemic misalignment between the CBM and the expert's reasoning, such as when the model learns shortcuts from biased data. To address this, we present a general interpretable debugging framework for CBMs that follows a two-step process of Removal and Retraining. In the Removal step, experts use concept explanations to identify and remove any undesired concepts. In the Retraining step, we introduce CBDebug, a novel method that leverages the interpretability of CBMs as a bridge for converting concept-level user feedback into sample-level auxiliary labels. These labels are then used to apply supervised bias mitigation and targeted augmentation, reducing the model's reliance on undesired concepts. We evaluate our framework with both real and automated expert feedback, and find that CBDebug significantly outperforms prior retraining methods across multiple CBM architectures (PIP-Net, Post-hoc CBM) and benchmarks with known spurious correlations.


Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Rasaee, Hamza, Koleilat, Taha, Rivaz, Hassan

arXiv.org Artificial Intelligence

Abstract-- Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code. Ultrasound imaging is extensively used in clinical practice due to its safety, affordability, portability, and real-time capabilities. It plays a vital role in cancer screening, disease staging, and image-guided interventions across various anatomies, including the breast, thyroid, liver, prostate, kidney, and musculoskeletal system. Despite these advantages, ultrasound imaging presents intrinsic challenges that complicate automated analysis. Issues like low tissue contrast, speckle noise, acoustic shadowing, and operator-dependent variability degrade image quality and hinder the precise delineation of anatomical structures, ultimately affecting automated segmentation algorithms' performance and generalizability.


HistoViT: Vision Transformer for Accurate and Scalable Histopathological Cancer Diagnosis

Ahmed, Faisal

arXiv.org Artificial Intelligence

Accurate and scalable cancer diagnosis remains a critical challenge in modern pathology, particularly for malignancies such as breast, prostate, bone, and cervical, which exhibit complex histological variability. In this study, we propose a transformer-based deep learning framework for multi-class tumor classification in histopathological images. Leveraging a fine-tuned Vision Transformer (ViT) architecture, our method addresses key limitations of conventional convolutional neural networks, offering improved performance, reduced preprocessing requirements, and enhanced scalability across tissue types. To adapt the model for histopathological cancer images, we implement a streamlined preprocessing pipeline that converts tiled whole-slide images into PyTorch tensors and standardizes them through data normalization. This ensures compatibility with the ViT architecture and enhances both convergence stability and overall classification performance. We evaluate our model on four benchmark datasets: ICIAR2018 (breast), SICAPv2 (prostate), UT-Osteosarcoma (bone), and SipakMed (cervical) dataset -- demonstrating consistent outperformance over existing deep learning methods. Our approach achieves classification accuracies of 99.32%, 96.92%, 95.28%, and 96.94% for breast, prostate, bone, and cervical cancers respectively, with area under the ROC curve (AUC) scores exceeding 99% across all datasets. These results confirm the robustness, generalizability, and clinical potential of transformer-based architectures in digital pathology. Our work represents a significant advancement toward reliable, automated, and interpretable cancer diagnosis systems that can alleviate diagnostic burdens and improve healthcare outcomes.


Google's New AI Puts Breasts on Minors--And J. D. Vance

The Atlantic - Technology

Sorry to tell you this, but Google's new AI shopping tool appears eager to give J. D. Vance breasts. This week, at its annual software conference, Google released an AI tool called Try It On, which acts as a virtual dressing room: Upload images of yourself while shopping for clothes online, and Google will show you what you might look like in a selected garment. Curious to play around with the tool, we began uploading images of famous men--Vance, Sam Altman, Abraham Lincoln, Michelangelo's David, Pope Leo XIV--and dressed them in linen shirts and three-piece suits. But when we tested a number of articles designed for women on these famous men, the tool quickly adapted: Whether it was a mesh shirt, a low-cut top, or even just a T-shirt, Google's AI rapidly spun up images of the vice president, the CEO of OpenAI, and the vicar of Christ with breasts. It's not just men: When we uploaded images of women, the tool repeatedly enhanced their décolletage or added breasts that were not visible in the original images.


A Multi-Modal AI System for Screening Mammography: Integrating 2D and 3D Imaging to Improve Breast Cancer Detection in a Prospective Clinical Study

Park, Jungkyu, Witowski, Jan, Xu, Yanqi, Trivedi, Hari, Gichoya, Judy, Brown-Mulry, Beatrice, Westerhoff, Malte, Moy, Linda, Heacock, Laura, Lewin, Alana, Geras, Krzysztof J.

arXiv.org Artificial Intelligence

Although digital breast tomosynthesis (DBT) improves diagnostic performance over full-field digital mammography (FFDM), false-positive recalls remain a concern in breast cancer screening. We developed a multi-modal artificial intelligence system integrating FFDM, synthetic mammography, and DBT to provide breast-level predictions and bounding-box localizations of suspicious findings. Our AI system, trained on approximately 500,000 mammography exams, achieved 0.945 AUROC on an internal test set. It demonstrated capacity to reduce recalls by 31.7% and radiologist workload by 43.8% while maintaining 100% sensitivity, underscoring its potential to improve clinical workflows. External validation confirmed strong generalizability, reducing the gap to a perfect AUROC by 35.31%-69.14% relative to strong baselines. In prospective deployment across 18 sites, the system reduced recall rates for low-risk cases. An improved version, trained on over 750,000 exams with additional labels, further reduced the gap by 18.86%-56.62% across large external datasets. Overall, these results underscore the importance of utilizing all available imaging modalities, demonstrate the potential for clinical impact, and indicate feasibility of further reduction of the test error with increased training set when using large-capacity neural networks.