Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Rasaee, Hamza, Koleilat, Taha, Rivaz, Hassan

arXiv.org Artificial Intelligence 

Abstract-- Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code. Ultrasound imaging is extensively used in clinical practice due to its safety, affordability, portability, and real-time capabilities. It plays a vital role in cancer screening, disease staging, and image-guided interventions across various anatomies, including the breast, thyroid, liver, prostate, kidney, and musculoskeletal system. Despite these advantages, ultrasound imaging presents intrinsic challenges that complicate automated analysis. Issues like low tissue contrast, speckle noise, acoustic shadowing, and operator-dependent variability degrade image quality and hinder the precise delineation of anatomical structures, ultimately affecting automated segmentation algorithms' performance and generalizability.