Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition

Surikuchi, Aditya K, Fernández, Raquel, Pezzelle, Sandro

arXiv.org Artificial Intelligence 

For both human speakers and which we test in a zero-shot manner. We machine learning models, the task requires connecting show that LLaVA (Liu et al., 2024), a powerful the visual data causally, to generate a narrative foundation model, performs best on the task, but consistent with the contents of the images. As for only slightly so than TAPM (Yu et al., 2021), a model-generated stories, evaluation is one of the model designed for visual storytelling which is 50 key challenges due to the inherently creative nature times smaller than LLaVA. Second, given insights of the task. Since human-written stories are derived from our proposed distance-based evaluation typically used to train visual storytelling models-- method, we upgrade the visual and language under the assumption that these stories provide a components on TAPM, resulting in a model that good learning signal--most previous work evaluated achieves comparable performance to LLaVA with model-generated stories by directly comparing a significantly lower number of parameters.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found