yolov12
DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
P, Malaisree, S, Youwai, T, Kitkobsin, S, Janrungautai, D, Amorndechaphon, P, Rojanavasu
Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO - YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self - supervised vision transformers for data - efficient detection . DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid - backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improveme nt, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real - time inference (30 - 47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium - scale architectures ach ieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), The 2 - 4 inference overhead (21 - 33ms versus 8 - 16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO - YOLO establishes state - of - the - art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data - constrained environments . Keywords: object detection, DINO pre - trained weights, transfer learning, YOLO, self - supervised learning, small datasets 1. I ntroduction Object detection has emerged as a fundamental computer vision task with widespread applications across numerous domains, from autonomous vehicles to industrial inspection systems. The evolution of deep learning architectures, particularly the You Only Look Once (YOLO) family of models (Khanam and Hussain, 2024; Tian et al., 2025; Wang et al., 2024; Wang and Liao, 2024; Youwai et al., 2024), has significantly advanced real - time object detection capabilities by achieving remarkable balance between accuracy and computational efficiency. However, conventional object detection frameworks face persistent challenges when deployed in specialized do mains with limited training data, where traditional random weight initialization strategies often lead to suboptimal convergence and inadequate feature representation learning.
- Transportation (1.00)
- Construction & Engineering (0.68)
A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones
Sadat, Sami, Hossain, Mohammad Irtiza, Sifat, Junaid Ahmed, Rafi, Suhail Haque, Alvi, Md. Waseq Alauddin, Rhaman, Md. Khalilur
A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed in this research due to its critical safety requirements. The dataset contained 8,124 images which came from 20 different scenarios along with images from 2,708 raw samples demonstrating low-light areas. We implemented an evaluation of three advanced object detection models which included YOLOv8 and YOLOv11 and YOLOv12 followed by development of our custom model that derived its design from YOLOv8 through added structures for facing demanding surveillance contexts. The proposed model outperformed other evaluated models by reaching recall of 78.90% and mAP@50 of 83.70% to deliver optimal object identification and detection results across different environments. A performance evaluation for inference involved analysing multiple edge devices through mul-tithreaded operations. The Jetson Xavier NX processed information at the fastest real-time rate of 52-97 ms which established its suitability for time-sensitive operations. The study establishes the proposed system delivers a fair and adjustable platform to monitor public safety processes while enabling automatic regulatory compliance checks.
Robust Pan-Cancer Mitotic Figure Detection with YOLOv12
Bourgade, Raphaël, Balezo, Guillaume, Feki, Hana, Monier, Lily, Blons, Matthieu, Blondel, Alice, Loussouarn, Delphine, Vincent-Salomon, Anne, Walter, Thomas
Detecting mitotic figures (MFs) in histopathology images remains a challenging task. Their quantification traditionally relies on the manual identification of "hot spots" by pathologists, followed by visual counting--an approach that is inherently subjective and may not reliably reflect the true prolifer-ative activity of a tumor. With the rise of digital pathology and artificial intelligence, numerous efforts have been made to automate mitosis detection in order to enhance accuracy, reproducibility, and scalability. Among these, the MItosis DOmain Generalization (MIDOG) challenges have emerged as a key benchmark for evaluating the generalizability of detection algorithms under realistic domain shifts. The 2021 edition (1) addressed scanner-induced variability using breast cancer WSIs, while the 2022 edition (2) extended the scope to include multiple tissue types and species, introducing further biological diversity. The 2025 MIDOG challenge (3) builds on these foundations with the most comprehensive mitosis-annotated dataset to date, and introduces two tasks: (1) detecting mitotic figures in arbitrary tumor tissue, and (2) determining whether a mitotic figure is atypical or normal. These tasks represent a significant step toward developing robust mitosis detection systems that generalize across diverse and complex histological conditions. In this work, we present a high-performance detection pipeline based on the YOLOv12 object detection architecture.
A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset
Yang, Haiyu, Liu, Enhong, Sun, Jennifer, Sharma, Sumit, van Leerdam, Meike, Franceschini, Sebastien, Niu, Puchun, Hostens, Miel
Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware tracking and segmentation, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with 93.3% identity preservation score and 89.3% object detection precision. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Health & Medicine (1.00)
- Food & Agriculture > Agriculture (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models
Ömercikoğlu, Ahmet Can, Yönügül, Mustafa Mansur, Erdoğmuş, Pakize
Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model's performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.
Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID
Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the YOLOv5 with the DeepSORT pipeline, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the metrics from the 4th Anti-UAV Challenge and demonstrate competitive performance. Notably, we achieve strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for the multi-UAV tracking task. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .
- Oceania > Australia (0.04)
- North America > United States (0.04)
- North America > Canada (0.04)
- (2 more...)
- Information Technology (0.48)
- Aerospace & Defense > Aircraft (0.34)
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10
Sapkota, Ranjan, Karkee, Manoj
This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest mean Average Precision (mAP@50) at 0.978. In comparison, the YOLOv11 series was led by YOLO11x, which achieved the highest precision at 0.857, recall at 0.85, and mAP@50 at 0.91. For the YOLOv10 series, YOLOv10b and YOLOv10l both achieved the highest precision at 0.85, with YOLOv10n achieving the highest recall at 0.8 and mAP@50 at 0.89. These findings demonstrated that YOLOv12, when trained on realistic LLM-generated datasets surpassed its predecessors in key performance metrics. The technique also offered a cost-effective solution by reducing the need for extensive manual data collection in the agricultural field. In addition, this study compared the computational efficiency of all versions of YOLOv12, v11 and v10, where YOLOv11n reported the lowest inference time at 4.7 ms, compared to YOLOv12n's 5.6 ms and YOLOv10n's 5.9 ms. Although YOLOv12 is new and more accurate than YOLOv11, and YOLOv10, YOLO11n still stays the fastest YOLO model among YOLOv10, YOLOv11 and YOLOv12 series of models. (Index: YOLOv12, YOLOv11, YOLOv10, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO Object detection)
- North America > United States > Washington (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- Asia > Nepal (0.04)
YOLOv12: A Breakdown of the Key Architectural Features
Alif, Mujadded Al Rabbani, Hussain, Muhammad
This paper presents an architectural analysis of YOLOv12, a significant advancement in single-stage, real-time object detection building upon the strengths of its predecessors while introducing key improvements. The model incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and FlashAttention-driven area-based attention, improving feature extraction, enhanced efficiency, and robust detections. With multiple model variants, similar to its predecessors, YOLOv12 offers scalable solutions for both latency-sensitive and high-accuracy applications. Experimental results manifest consistent gains in mean average precision (mAP) and inference speed, making YOLOv12 a compelling choice for applications in autonomous systems, security, and real-time analytics. By achieving an optimal balance between computational efficiency and performance, YOLOv12 sets a new benchmark for real-time computer vision, facilitating deployment across diverse hardware platforms, from edge devices to high-performance clusters.
YOLOv12: Attention-Centric Real-Time Object Detectors
Tian, Yunjie, Ye, Qixiang, Doermann, David
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Indiana > Marion County > Lawrence (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)