StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

2405.01434

Published 5/3/2024 by Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Abstract

For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper presents "StoryDiffusion," a novel approach for consistent long-range image and video generation using self-attention.
The key idea is to introduce a consistency mechanism that enforces coherence across the generated frames or images, addressing a common issue in diffusion models.
The proposed method outperforms state-of-the-art techniques in various benchmarks for image and video generation tasks.

Plain English Explanation

The paper introduces a new way to create realistic-looking images and videos using a technique called "diffusion models." Diffusion models work by gradually adding random noise to an image or video, then gradually removing that noise to generate new content. However, one problem with these models is that the generated images or frames can sometimes be inconsistent or incoherent, especially when generating long sequences.

To address this, the researchers developed "StoryDiffusion," which includes a special mechanism to help maintain consistency across the generated images or frames. This ensures that the overall story or narrative remains cohesive, even as new content is being generated. The researchers show that StoryDiffusion outperforms other state-of-the-art techniques on a variety of benchmarks for image and video generation.

In other words, StoryDiffusion makes it easier to generate high-quality, consistent images and videos using diffusion models, which could be useful for applications like video-editing, 3D content generation, and video inpainting.

Technical Explanation

The key innovation in StoryDiffusion is the introduction of a "consistency mechanism" that helps maintain coherence across the generated frames or images. This is achieved by modifying the self-attention layers in the diffusion model to capture long-range dependencies between the generated content.

Specifically, the researchers propose a Swap Attention module that computes attention across both the spatial and temporal dimensions of the input. This allows the model to maintain a consistent narrative or story, even as new content is being generated. The Swap Attention module is integrated into the diffusion model's architecture, and the entire system is trained end-to-end.

The researchers evaluate StoryDiffusion on several image and video generation benchmarks, including datasets like COCO, Kinetics, and BAIR. They show that StoryDiffusion outperforms state-of-the-art techniques in terms of both visual quality and consistency, as measured by various quantitative metrics.

Critical Analysis

The researchers acknowledge several limitations of their approach. First, while StoryDiffusion improves consistency, there is still room for further improvements, especially when generating long video sequences. Additionally, the model's computational and memory requirements are higher than simpler diffusion models, which could limit its practical deployment.

Another potential issue is that the consistency mechanism might constrain the model's ability to generate highly diverse or creative content, as it is biased towards maintaining coherence. The researchers do not explore this tradeoff in depth, and it would be interesting to see how StoryDiffusion compares to less constrained models in terms of the variety of generated outputs.

Finally, the paper does not provide a detailed analysis of failure cases or potential biases in the generated content. As with any generative model, it is important to carefully evaluate the safety and fairness implications of such systems, especially when they are used for tasks like video-editing or video inpainting.

Conclusion

In summary, the StoryDiffusion paper presents a novel approach for improving the consistency of long-range image and video generation using diffusion models. By introducing a Swap Attention mechanism, the researchers are able to maintain coherence across the generated content, which is a significant advancement over prior techniques.

While the method has some limitations and areas for further research, the overall approach represents an important step towards more semantically-consistent and controllable generation of visual content. As diffusion models continue to mature, techniques like StoryDiffusion will likely play a crucial role in enabling more realistic and coherent image and video generation for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu

With the explosive popularity of AI-generated content (AIGC), video generation has recently received a lot of attention. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Existing text-video datasets suffer from limitations in both content quality and scale, or they are not open-source, rendering them inaccessible for study and use. For model design, previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the ``query'' role between spatial and temporal blocks, enabling mutual reinforcement for each other. Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. A smaller-scale yet more meticulously cleaned subset further enhances the data quality, aiding models in achieving superior performance. Experimental quantitative and qualitative results demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.

4/9/2024

cs.CV

📶

Semantically Consistent Video Inpainting with Conditional Diffusion Models

Dylan Green, William Harvey, Saeid Naderiparizi, Matthew Niedoba, Yunpeng Liu, Xiaoxuan Liang, Jonathan Lavington, Ke Zhang, Vasileios Lioutas, Setareh Dabiri, Adam Scibior, Berend Zwartsenberg, Frank Wood

Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.

5/2/2024

cs.CV cs.LG

🐍

High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

5/6/2024

cs.CV cs.AI

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models

Zeyu Yang, Zijie Pan, Chun Gu, Li Zhang

Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion$^2$, a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of video and multi-view diffusion models based on the probability structure of the images to be generated. Owing to the high parallelism of the image generation and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Furthermore, our method circumvents the reliance on 4D data, thereby having the potential to benefit from the scalability of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework and its capability to flexibly adapt to various types of prompts.

4/23/2024

cs.CV