Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation

Zeng, Runhao, Deng, Qi, Zhang, Ronghao, Niu, Shuaicheng, Chen, Jian, Hu, Xiping, Leung, Victor C. M.

Jun-17-2025–arXiv.org Artificial Intelligence

--T est-time adaptation (TT A) aims to boost the generalization capability of a trained model by conducting self- /unsupervised learning during the testing phase. While most existing TT A methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. T o address this gap, we propose a novel approach that incorporates audio information into video TT A. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TT A. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. T o effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TT A datasets (A VE-C and A VMIT -C) with various corruption types, demonstrate the superiority of our approach. EEP neural networks have achieved significant success in various video analysis tasks [1]-[4], but most methods assume that training and testing data come from the same distribution. This work was supported by the National Natural Science Foundation of China (NSFC) (Grant Nos. Qi Deng, Ronghao Zhang and Jian Chen are with School of Software Engineering, South China University of Technology, Guangzhou, 510000, China. Shuaicheng Niu is with College of Computing and Data Science, Nanyang Technological University, 639798, Singapore. Existing video test-time adaptation methods rely on visual supervision, overlooking the rich information inherent in audio. We propose a novel approach that involves extracting audio from videos and mapping the results of an open-source audio model to the video label space.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Guangdong Province > Guangzhou (0.24)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment (1.00)
- Media > Music (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (0.93)
  - Natural Language > Large Language Model (0.91)
  - Machine Learning > Neural Networks (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found