When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Oct-21-2025–arXiv.org Artificial Intelligence

Object detection constitutes a foundational computer vision capability enabling diverse applications from autonomous vehicles to retail analytics, with modern deep learning approaches achieving remarkable technical performance exceeding 90% mean Average Precision on standardized benchmarks [1, 2]. However, technical accuracy represents only one dimension of deployment viability, as real-world system selection requires evaluating cost-effectiveness--the relationship between detection performance and total economic investment required to achieve that performance [3, 4]. Traditional supervised detectors, exemplified by the YOLO architecture family [2, 5], rely fundamentally on manually annotated training data, with industry reports estimating annotation costs between $0.10 and $0.50 per bounding box [6, 7], translating to $9,000-$45,000 for establishing 100-category detection systems with sufficient training data. Vision-Language Models represent an alternative paradigm achieving object detection through zero-shot inference without task-specific supervision [8-10]. Pre-trained on billions of image-text pairs, VLMs accept natural language object descriptions and generate bounding box predictions through learned visual-linguistic alignment, fundamentally eliminating annotation requirements.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.46)
  - Experimental Study (0.46)

Industry:
- Information Technology (1.00)
- Health & Medicine > Diagnostic Medicine (0.68)
- Semiconductors & Electronics (0.68)
- Automobiles & Trucks > Manufacturer (0.68)
- Transportation
  - Ground > Road (1.00)
  - Electric Vehicle (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.99)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found