Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

Jan-14-2025–arXiv.org Artificial Intelligence

Moreover, such models suffer from overfitting such that Transformer-based solutions are the backbone of current once given a video from an unseen context or distribution state-of-the-art methods for language generation, image the quality and accuracy of the description drops, as our and video classification, segmentation, action and object evaluations prove. On the other hand, VLLMs have shown recognition, among many others. Interestingly enough, impressive results, being capable of generating long, rich while these state-of-the-art methods produce impressive results descriptions of videos. Unfortunately VLLMs still share in their respective domains, the problem of understanding some of the same weaknesses as previous methods: they are the relationship between vision and language is largely unexplainable and they still rely on sampling frames still beyond our reach. In this work, we propose a common to process a video. Moreover, top-performing models such ground between vision and language based on events as GPT, Claude or Gemini are not open and are only accessible in space and time in an explainable and programmatic way, via an paid API. to connect learning-based vision and language state of the We argue that one of the main reasons why this interdisciplinary art models and provide a solution to the long standing problem cross-domain task is still far from being solved is of describing videos in natural language. We validate that we still lack an explainable way to bridge this apparently that our algorithmic approach is able to generate coherent, insurmountable gap. Explainability could provide a rich and relevant textual descriptions on videos collected more analytical and stage-wise way to make the transition from a variety of datasets, using both standard metrics (e.g. from vision to language that is both trustworthy and makes Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jan-14-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found