RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

Apr-10-2025–arXiv.org Artificial Intelligence

We present RA VEN ( R ecognition and A daptation of Video ENtities), an adaptive AI agent framework designed for mul-timodal entity discovery and retrieval in large-scale video collections. Synthesizing information across visual, audio, and textual modalities, RA VEN autonomously processes video data to produce structured, actionable representations for downstream tasks. Key contributions include (1) a category understanding step to infer video themes and general-purpose entities, (2) a schema generation mechanism that dynamically defines domain-specific entities and attributes, and (3) a rich entity extraction process that leverages semantic retrieval and schema-guided prompting. RA VEN is designed to be model-agnostic, allowing the integration of different vision-language models (VLMs) and large language models (LLMs) based on application-specific requirements. This flexibility supports diverse applications in personalized search, content discovery, and scalable information retrieval, enabling practical applications across vast datasets.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Apr-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Information Retrieval (0.90)
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)