AITopics | keyframe

We train the network by adopting mean-squared error as our loss function and using the AdamW optimizer [27]withalearning rateof0.01. Specifically,wefirstapply a2-layer MLP ontheoutput ofthepositional encoding layer,and then we stack 5NeRV blocks with upscale factors 5, 3, 2, 2, 2, respectively. To be specific, on the UVG-HD benchmark, we set the number of levels as 15, the number of features per level as 2, the maximum entries per level as224, and the coarsest resolution as 16. Table 7: Decoding time ofcoordinate-based representations measured with FPS (higher isbetter).

artificial intelligence, machine learning, readysetgo 0, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

5297e56ac65ba2bfa70ee9fc4818c042-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 23:57:46 GMT

computer vision and pattern recognition, representation, video, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Self-Chained Image-Language Model for Video Localization and Question Answering

Neural Information Processing SystemsDec-27-2025, 05:02:09 GMT

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and question answering on videos.

name change, self-chained image-language model, video localization, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Collaborating Authors

keyframe

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Overleaf Example

7ac484b0f1a1719ad5be9aa8c8455fbb-Paper-Conference.pdf

5c594bf6223b67109441c9e0c97542ed-Paper-Conference.pdf

Appendix for EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Y ao Mu

d93ed5b6db83be78efb0d05ae420158e-AuthorFeedback.pdf

bcf9d6bd14a2095866ce8c950b702341-AuthorFeedback.pdf

7503cfacd12053d309b6bed5c89de212-Paper.pdf

Appendix: ScalableNeuralVideoRepresentations withLearnablePositionalFeatures

5297e56ac65ba2bfa70ee9fc4818c042-Paper-Conference.pdf

Self-Chained Image-Language Model for Video Localization and Question Answering