HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues

Li, Xiwen, Tang, Xiaoya, Tasdizen, Tolga

Oct-27-2025–arXiv.org Artificial Intelligence

ABSTRACT Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model [1] with simple CBAM-based [2] bi-modal attention fails to handle these issues and often misses vehicles. We propose HA VT -IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HA VT -IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.

artificial intelligence, detection, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-27-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Utah (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Vision > Image Understanding (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found