StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

May-27-2025, 16:07:15 GMT–Neural Information Processing Systems

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a simple but effective self-attention mechanism, termed Consistent Self-Attention, that boosts the consistency between the generated images. It can be used to augment pre-trained diffusion-based text-to-image models in a zero-shot manner. Based on the images with consistent content, we further show that our method can be extended to long range video generation by introducing a semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces.

consistent self-attention, long-range image and video generation, storydiffusion, (3 more...)

Neural Information Processing Systems

May-27-2025, 16:07:15 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.62)