VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Rokuss, Maximilian, Langenberg, Moritz, Kirchhoff, Yannick, Isensee, Fabian, Hamm, Benjamin, Ulrich, Constantin, Regnery, Sebastian, Bauer, Lukas, Katsigiannopulos, Efthimios, Norajitra, Tobias, Maier-Hein, Klaus
–arXiv.org Artificial Intelligence
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
arXiv.org Artificial Intelligence
Nov-17-2025
- Country:
- Africa (0.45)
- Europe > Germany (0.28)
- North America > United States (0.45)
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (1.00)
- Health Care Technology (1.00)
- Nuclear Medicine (1.00)
- Therapeutic Area
- Cardiology/Vascular Diseases (1.00)
- Neurology (1.00)
- Oncology
- Carcinoma (0.67)
- Lung Cancer (0.46)
- Pulmonary/Respiratory Diseases (1.00)
- Health & Medicine
- Technology: