keyframe
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Vision (0.96)
- Information Technology > Artificial Intelligence > Robots (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
Appendix: ScalableNeuralVideoRepresentations withLearnablePositionalFeatures
We train the network by adopting mean-squared error as our loss function and using the AdamW optimizer [27]withalearning rateof0.01. Specifically,wefirstapply a2-layer MLP ontheoutput ofthepositional encoding layer,and then we stack 5NeRV blocks with upscale factors 5, 3, 2, 2, 2, respectively. To be specific, on the UVG-HD benchmark, we set the number of levels as 15, the number of features per level as 2, the maximum entries per level as224, and the coarsest resolution as 16. Table 7: Decoding time ofcoordinate-based representations measured with FPS (higher isbetter).
- North America > United States > New York > New York County > New York City (0.04)
- Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
Self-Chained Image-Language Model for Video Localization and Question Answering
Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and question answering on videos.