pororo
TemporalStory: Enhancing Consistency in Story Visualization using Spatial-Temporal Attention
Story visualization presents a challenging task in text-to-image generation, requiring not only the rendering of visual details from text prompt but also ensuring consistency across images. Recently, most approaches address inconsistency problem using an auto-regressive manner conditioned on previous image-sentence pairs. However, they overlook the fact that story context is dispersed across all sentences. The auto-regressive approach fails to encode information from susequent image-sentence pairs, thus unable to capture the entirety of the story context. To address this, we introduce TemporalStory, leveraging Spatial-Temporal attention to model complex spatial and temporal dependencies in images, enabling the generation of coherent images based on a given storyline. In order to better understand the storyline context, we introduce a text adapter capable of integrating information from other sentences into the embedding of the current sentence. Additionally, to utilize scene changes between story images as guidance for the model, we propose the StoryFlow Adapter to measure the degree of change between images. Through extensive experiments on two popular benchmarks, PororoSV and FlintstonesSV, our TemporalStory outperforms the previous state-of-the-art in both story visualization and story continuation tasks.
Masked Generative Story Transformer with Character Guidance and Caption Augmentation
Papadimitriou, Christos, Filandrianos, Giorgos, Lymperaiou, Maria, Stamou, Giorgos
Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Story Visualization by Online Text Augmentation with Context Memory
Ahn, Daechul, Kim, Daneul, Song, Gwangmo, Kim, Seung Hwan, Lee, Honglak, Kang, Dongyeop, Choi, Jonghyun
Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Minnesota (0.04)
- North America > United States > Michigan (0.04)
- Instructional Material > Online (0.61)
- Instructional Material > Course Syllabus & Notes (0.61)
Improved Visual Story Generation with Adaptive Context Modeling
Feng, Zhangyin, Ren, Yuchen, Yu, Xinmiao, Feng, Xiaocheng, Tang, Duyu, Shi, Shuming, Qin, Bing
Diffusion models developed on top of powerful text-to-image generation models like Stable Diffusion achieve remarkable success in visual story generation. However, the best-performing approach considers historically generated results as flattened memory cells, ignoring the fact that not all preceding images contribute equally to the generation of the characters and scenes at the current stage. To address this, we present a simple method that improves the leading system with adaptive context modeling, which is not only incorporated in the encoder but also adopted as additional guidance in the sampling stage to boost the global consistency of the generated story. We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios. We conduct detailed model analysis and show that our model excels at generating semantically consistent images for stories.
- North America > Dominican Republic (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)