Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs

Reyes-Amezcua, Ivan, Lopez-Tiro, Francisco, Larose, Clement, Mendez-Vazquez, Andres, Ochoa-Ruiz, Gilberto, Daul, Christian

Aug-22-2025–arXiv.org Artificial Intelligence

Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convo-lutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.

artificial intelligence, imagenet-1k 0, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Aug-22-2025

arXiv.org PDF

Add feedback

Country:
- Europe > France
  - Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- North America > Mexico
  - Jalisco > Guadalajara (0.04)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Health & Medicine > Therapeutic Area
  - Nephrology (0.89)
  - Urology (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.68)
    - Performance Analysis > Accuracy (1.00)
  - Vision (1.00)