Does Video-Text Pretraining Help Open-Vocabulary Online Action Detection?

Mar-20-2026, 16:06:03 GMT–Neural Information Processing Systems

Video understanding relies on accurate action detection for temporal analysis. However, existing mainstream methods have limitations in real-world applications due to their offline and closed-set evaluation approaches, as well as their dependence on manual annotations. To address these challenges and enable real-time action understanding in open-world scenarios, we propose OV-OAD, a zero-shot online action detector that leverages vision-language models and learns solely from text supervision.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Mar-20-2026, 16:06:03 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.60)
  - Natural Language > Large Language Model (0.47)