Training-free Online Video Step Grounding
–Neural Information Processing Systems
Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs).
Neural Information Processing Systems
Jun-19-2026, 17:39:17 GMT
- Country:
- Europe (0.28)
- Genre:
- Workflow (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Industry:
- Education > Educational Setting > Online (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Natural Language > Large Language Model (1.00)
- Machine Learning
- Neural Networks > Deep Learning (0.69)
- Performance Analysis > Accuracy (0.46)
- Information Technology > Artificial Intelligence