HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
Li, Xiwen, Tang, Xiaoya, Tasdizen, Tolga
–arXiv.org Artificial Intelligence
ABSTRACT Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model [1] with simple CBAM-based [2] bi-modal attention fails to handle these issues and often misses vehicles. We propose HA VT -IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HA VT -IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.
arXiv.org Artificial Intelligence
Oct-27-2025
- Country:
- Asia > Myanmar
- Tanintharyi Region > Dawei (0.04)
- North America > United States
- Utah > Salt Lake County > Salt Lake City (0.05)
- Asia > Myanmar
- Genre:
- Research Report (0.64)
- Technology: