Does Video-Text Pretraining Help Open-Vocabulary Online Action Detection?
–Neural Information Processing Systems
Video understanding relies on accurate action detection for temporal analysis. However, existing mainstream methods have limitations in real-world applications due to their offline and closed-set evaluation approaches, as well as their dependence on manual annotations. To address these challenges and enable real-time action understanding in open-world scenarios, we propose OV-OAD, a zero-shot online action detector that leverages vision-language models and learns solely from text supervision.
Neural Information Processing Systems
Dec-25-2025, 22:55:33 GMT
- Technology: