Multimodal Event Transformer for Image-guided Story Ending Generation
–arXiv.org Artificial Intelligence
Image-guided story ending generation (IgSEG) is to generate a story ending based on given story plots and ending image. Existing methods focus on cross-modal feature fusion but overlook reasoning and mining implicit information from story plots and ending image. To tackle this drawback, we propose a multimodal event transformer, an event-based reasoning framework for IgSEG. Specifically, we construct visual and semantic event graphs from story plots and ending image, and leverage event-based reasoning to reason and mine implicit information in a single modality. Next, we connect visual and semantic event graphs and utilize cross-modal fusion to integrate different-modality features. In addition, we propose a multimodal injector to adaptive pass essential information to decoder. Besides, we present an incoherence detection to enhance the understanding context of a story plot and the robustness of graph modeling for our model. Experimental results show that our method achieves state-of-the-art performance for the image-guided story ending generation.
arXiv.org Artificial Intelligence
Jan-26-2023
- Country:
- Oceania > Australia
- North America
- United States
- Michigan (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- New York > New York County
- New York City (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Massachusetts > Suffolk County
- Boston (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California
- San Diego County > San Diego (0.04)
- Los Angeles County > Long Beach (0.04)
- Canada > British Columbia
- United States
- Europe
- Asia
- China (0.04)
- Macao (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Genre:
- Research Report (0.84)
- Technology: