Slot-VLM: Object-Event Slots for Video-Language Modeling

May-28-2025, 06:31:27 GMT–Neural Information Processing Systems

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an effective method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a new framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

May-28-2025, 06:31:27 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)