Controllable Longer Image Animation with Diffusion Models






Published 5/29/2024 by Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu
Controllable Longer Image Animation with Diffusion Models


Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page:

Create account to get full access


If you already have an account, we'll log you in


  • This paper presents a new method for generating long, controllable image animations using diffusion models.
  • The approach allows users to specify high-level keyframes and other controls to guide the animation process, resulting in coherent and realistic-looking videos.
  • The authors demonstrate the capabilities of their method on a variety of tasks, including loopable animations, multi-video generation, and video diffusion models.

Plain English Explanation

The paper describes a new way to create long, animated videos using a type of AI model called a "diffusion model." Diffusion models are good at generating images, but until now, it's been difficult to use them to create videos that are both long and controllable.

The key idea in this paper is to give the diffusion model some high-level instructions or "keyframes" that guide the animation process. For example, you could tell the model to start with a certain image, then have an object in the scene move in a particular way, and end up in a different position. The model then fills in all the intermediate frames to create a smooth, coherent animation.

This approach allows for a lot of creative control over the video generation process. You can specify exactly what should happen, and the model will fill in the details to bring your vision to life. The authors show how this can be used for all sorts of tasks, from creating looping animations to generating multiple videos that are consistent with each other.

Overall, this research represents an important step forward in making AI-powered video generation more accessible and controllable for a wide range of applications, from visual effects to creative storytelling.

Technical Explanation

The paper introduces a new method for generating long, controllable image animations using diffusion models. Diffusion models are a type of generative AI model that have shown great success in generating high-quality static images. However, extending these models to the video domain has proven challenging, as it requires capturing long-range temporal dependencies and maintaining visual coherence over extended sequences.

To address these challenges, the authors propose a framework that allows users to specify high-level keyframes and other controls to guide the animation process. This is achieved by incorporating a novel self-attention mechanism that captures long-range dependencies, as well as a consistency loss function that encourages the generated frames to adhere to the provided controls.

The authors demonstrate the capabilities of their method on a variety of tasks, including loopable animations, multi-video generation, and video diffusion models. In these experiments, they show that their approach can generate coherent and realistic-looking videos that faithfully follow the user's input controls.

Critical Analysis

The paper presents a compelling approach to long-form, controllable image animation using diffusion models. The authors have clearly put a lot of thought into the technical challenges involved and have developed novel solutions to address them.

That said, the paper does not delve deeply into the limitations of the proposed method. For example, it's unclear how the approach scales to very long videos or highly complex scene dynamics. Additionally, the paper does not provide a detailed analysis of the computational and memory requirements of the model, which could be an important consideration for real-world applications.

Furthermore, the authors do not discuss potential biases or artifacts that could arise in the generated videos, nor do they address potential ethical concerns around the use of such technology. As AI-powered video generation becomes more advanced, these are important issues that will need to be carefully considered.

Overall, the research presented in this paper is a significant contribution to the field of generative AI, and the authors have demonstrated a promising approach to long-form, controllable image animation. However, there are still important questions and challenges that will need to be addressed in future work.


This paper introduces a new method for generating long, controllable image animations using diffusion models. By incorporating a novel self-attention mechanism and a consistency loss function, the authors have developed a framework that allows users to specify high-level keyframes and other controls to guide the animation process.

The authors demonstrate the capabilities of their approach on a variety of tasks, including loopable animations, multi-video generation, and video diffusion models. The results show that their method can produce coherent and realistic-looking videos that faithfully follow the user's input controls.

While the paper represents an important step forward in making AI-powered video generation more accessible and controllable, there are still important questions and challenges that will need to be addressed in future work. As this technology continues to evolve, it will be crucial to carefully consider the potential implications and ensure that it is developed and deployed in a responsible and ethical manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo





Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

Read more


StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou





For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at

Read more



Generative Image Dynamics

Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski





We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.

Read more


Training-free Camera Control for Video Generation

Training-free Camera Control for Video Generation

Chen Hou, Guoqiang Wei, Yan Zeng, Zhibo Chen





We propose a training-free and robust solution to offer camera movement control for off-the-shelf video diffusion models. Unlike previous work, our method does not require any supervised finetuning on camera-annotated datasets or self-supervised training via data augmentation. Instead, it can be plugged and played with most pretrained video diffusion models and generate camera controllable videos with a single image or text prompt as input. The inspiration of our work comes from the layout prior that intermediate latents hold towards generated results, thus rearranging noisy pixels in them will make output content reallocated as well. As camera move could also be seen as a kind of pixel rearrangement caused by perspective change, videos could be reorganized following specific camera motion if their noisy latents change accordingly. Established on this, we propose our method CamTrol, which enables robust camera control for video diffusion models. It is achieved by a two-stage process. First, we model image layout rearrangement through explicit camera movement in 3D point cloud space. Second, we generate videos with camera motion using layout prior of noisy latents formed by a series of rearranged images. Extensive experiments have demonstrated the robustness our method holds in controlling camera motion of generated videos. Furthermore, we show that our method can produce impressive results in generating 3D rotation videos with dynamic content. Project page at

Read more
