Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication
Wang, Ruize, Wei, Zhongyu, Li, Piji, Shan, Haijun, Zhang, Ji, Zhang, Qi, Huang, Xuanjing
–arXiv.org Artificial Intelligence
Keep it Consistent: T opic-A ware Storytelling from an Image Stream via Iterative Multi-agent Communication Ruize Wang 1, Zhongyu Wei 2, Piji Li 3, Haijun Shan 4, Ji Zhang 4, Qi Zhang 5, Xuanjing Huang 5 1 Academy for Engineering and Technology, Fudan University, China 2 School of Data Science, Fudan University, China 3 Tencent AI Lab, China 4 Zhejiang Lab, China 5 School of Computer Science, Fudan University, China { rzwang18,zywei,qz,xjhuang} @fudan.edu.cn; Abstract Visual storytelling aims to generate a narrative paragraph from a sequence of images automatically. Existing approaches construct text description independently for each image and roughly concatenate them as a story, which leads to the problem of generating semantically incoherent content. In this paper, we proposed a new way for visual storytelling by introducing a topic description task to detect the global semantic context of an image stream. A story is then constructed with the guidance of the topic description. In order to combine the two generation tasks, we propose a multi-agent communication framework that regards the topic description generator and the story generator as two agents and learn them simultaneously via iterative updating mechanism. We validate our approach on VIST, where quantitative results, ablations, and human evaluation demonstrate our method's good ability in generating stories with higher quality compared to state-of-the-art methods. 1 Introduction Image-to-text generation is an important topic in artificial intelligence (AI) which connects computer vision (CV) and natural language processing (NLP). Popular tasks include image captioning (Karpathy and Fei-Fei 2015; Ren et al. 2017; Vinyals et al. 2017) and question answering (Antol et al. 2015; Y u et al. 2017; Fan et al. 2018a; Fan et al. 2018b), aiming at generating a short sentence or a phrase conditioned on certain visual information. It requires the model to understand the main idea of the image stream and generate coherent sentences. Most of existing methods (Huang et al. 2016; Liu et al. 2017; Y u, Bansal, and Berg 2017; Wang et al. 2018a) for visual storytelling extend approaches of image captioning without considering topic information of the image sequence, which causes the problem of generating semantically incoherent content.
arXiv.org Artificial Intelligence
Nov-11-2019
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning > Agents (1.00)
- Natural Language (1.00)
- Machine Learning > Neural Networks
- Deep Learning (0.46)
- Information Technology > Artificial Intelligence