When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
–arXiv.org Artificial Intelligence
Object detection constitutes a foundational computer vision capability enabling diverse applications from autonomous vehicles to retail analytics, with modern deep learning approaches achieving remarkable technical performance exceeding 90% mean Average Precision on standardized benchmarks [1, 2]. However, technical accuracy represents only one dimension of deployment viability, as real-world system selection requires evaluating cost-effectiveness--the relationship between detection performance and total economic investment required to achieve that performance [3, 4]. Traditional supervised detectors, exemplified by the YOLO architecture family [2, 5], rely fundamentally on manually annotated training data, with industry reports estimating annotation costs between $0.10 and $0.50 per bounding box [6, 7], translating to $9,000-$45,000 for establishing 100-category detection systems with sufficient training data. Vision-Language Models represent an alternative paradigm achieving object detection through zero-shot inference without task-specific supervision [8-10]. Pre-trained on billions of image-text pairs, VLMs accept natural language object descriptions and generate bounding box predictions through learned visual-linguistic alignment, fundamentally eliminating annotation requirements.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia > Middle East
- Iraq > Baghdad Governorate > Baghdad (0.04)
- North America > United States (0.14)
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (0.46)
- New Finding (0.46)
- Research Report
- Industry:
- Automobiles & Trucks > Manufacturer (0.68)
- Health & Medicine > Diagnostic Medicine (0.68)
- Information Technology (1.00)
- Semiconductors & Electronics (0.68)
- Transportation
- Electric Vehicle (0.93)
- Ground > Road (1.00)
- Technology: