Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Surikuchi, Aditya K, Fernández, Raquel, Pezzelle, Sandro
–arXiv.org Artificial Intelligence
For both human speakers and which we test in a zero-shot manner. We machine learning models, the task requires connecting show that LLaVA (Liu et al., 2024), a powerful the visual data causally, to generate a narrative foundation model, performs best on the task, but consistent with the contents of the images. As for only slightly so than TAPM (Yu et al., 2021), a model-generated stories, evaluation is one of the model designed for visual storytelling which is 50 key challenges due to the inherently creative nature times smaller than LLaVA. Second, given insights of the task. Since human-written stories are derived from our proposed distance-based evaluation typically used to train visual storytelling models-- method, we upgrade the visual and language under the assumption that these stories provide a components on TAPM, resulting in a model that good learning signal--most previous work evaluated achieves comparable performance to LLaVA with model-generated stories by directly comparing a significantly lower number of parameters.
arXiv.org Artificial Intelligence
Jul-5-2024
- Country:
- Asia > Middle East
- Qatar (0.14)
- North America > United States
- California (0.14)
- Michigan (0.14)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology: