Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition

Surikuchi, Aditya K, Fernández, Raquel, Pezzelle, Sandro

Jul-5-2024–arXiv.org Artificial Intelligence

For both human speakers and which we test in a zero-shot manner. We machine learning models, the task requires connecting show that LLaVA (Liu et al., 2024), a powerful the visual data causally, to generate a narrative foundation model, performs best on the task, but consistent with the contents of the images. As for only slightly so than TAPM (Yu et al., 2021), a model-generated stories, evaluation is one of the model designed for visual storytelling which is 50 key challenges due to the inherently creative nature times smaller than LLaVA. Second, given insights of the task. Since human-written stories are derived from our proposed distance-based evaluation typically used to train visual storytelling models-- method, we upgrade the visual and language under the assumption that these stories provide a components on TAPM, resulting in a model that good learning signal--most previous work evaluated achieves comparable performance to LLaVA with model-generated stories by directly comparing a significantly lower number of parameters.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-5-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Qatar (0.14)
- North America > United States
  - California (0.14)
  - Michigan (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning (1.00)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found