Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Reyes-Amezcua, Ivan, Lopez-Tiro, Francisco, Larose, Clement, Mendez-Vazquez, Andres, Ochoa-Ruiz, Gilberto, Daul, Christian
–arXiv.org Artificial Intelligence
Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convo-lutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.
arXiv.org Artificial Intelligence
Aug-22-2025
- Country:
- Europe > France
- Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- North America > Mexico
- Jalisco > Guadalajara (0.04)
- Europe > France
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Health & Medicine > Therapeutic Area
- Nephrology (0.89)
- Urology (1.00)
- Health & Medicine > Therapeutic Area
- Technology: