DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

P, Malaisree, S, Youwai, T, Kitkobsin, S, Janrungautai, D, Amorndechaphon, P, Rojanavasu

arXiv.org Artificial Intelligence 

Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO - YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self - supervised vision transformers for data - efficient detection . DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid - backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improveme nt, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real - time inference (30 - 47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium - scale architectures ach ieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), The 2 - 4 inference overhead (21 - 33ms versus 8 - 16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO - YOLO establishes state - of - the - art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data - constrained environments . Keywords: object detection, DINO pre - trained weights, transfer learning, YOLO, self - supervised learning, small datasets 1. I ntroduction Object detection has emerged as a fundamental computer vision task with widespread applications across numerous domains, from autonomous vehicles to industrial inspection systems. The evolution of deep learning architectures, particularly the You Only Look Once (YOLO) family of models (Khanam and Hussain, 2024; Tian et al., 2025; Wang et al., 2024; Wang and Liao, 2024; Youwai et al., 2024), has significantly advanced real - time object detection capabilities by achieving remarkable balance between accuracy and computational efficiency. However, conventional object detection frameworks face persistent challenges when deployed in specialized do mains with limited training data, where traditional random weight initialization strategies often lead to suboptimal convergence and inadequate feature representation learning.