Streaming Detection of Queried Event Start
–Neural Information Processing Systems
Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding--Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate four vision-language backbones and three adapter architectures in both short-clip and untrimmed video settings.
Neural Information Processing Systems
May-25-2025, 15:03:28 GMT
- Country:
- Europe (0.28)
- North America > United States (0.28)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government (0.92)
- Information Technology > Security & Privacy (0.46)
- Law (0.92)
- Technology: