UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

2406.01188

Published 6/4/2024 by Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang

cs.CV

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Abstract

Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: https://unianimate.github.io/.

Create account to get full access

Overview

This paper presents "UniAnimate", a method for consistently animating human images using unified video diffusion models.
The key innovations include taming the diversity of unified video diffusion models to generate coherent and realistic human animations, and leveraging human keypoint and motion information to improve animation quality and consistency.
The authors demonstrate the effectiveness of UniAnimate on several benchmarks, showing improvements over state-of-the-art video animation methods.

Plain English Explanation

The paper introduces a new technique called "UniAnimate" that can animate human images in a consistent and realistic way. Traditional video animation methods often struggle to maintain coherence and realism, especially for human subjects. UniAnimate addresses this by "taming" the diversity of unified video diffusion models - a powerful class of machine learning models used for generating video.

Specifically, UniAnimate leverages information about the human keypoints (e.g. joints, facial features) and their motion to improve the quality and consistency of the generated animations. This allows it to produce more natural-looking and coherent animations of human subjects compared to prior approaches.

The paper demonstrates the benefits of UniAnimate through experiments on several video animation benchmarks, showing that it outperforms existing state-of-the-art methods. This is an important advance, as consistent and realistic animation of humans has many potential applications in areas like virtual reality, digital entertainment, and human-computer interaction.

Technical Explanation

The key innovation in UniAnimate is its approach to "taming" the diversity of unified video diffusion models. Diffusion models are a powerful class of generative AI models that can be used to synthesize high-quality video. However, their unconstrained nature can lead to inconsistencies and artifacts when applied to the task of human image animation.

UniAnimate addresses this by incorporating human keypoint and motion information into the diffusion process. Specifically, it uses human pose estimation to extract keypoints like joint locations, and then conditions the diffusion model on these keypoints and their predicted motion. This helps the model generate animations that are more coherent and realistic, as the motion is grounded in the underlying human structure.

The authors extensively evaluate UniAnimate on several video animation benchmarks, including datasets of human subjects. They demonstrate that UniAnimate outperforms state-of-the-art methods in terms of both objective metrics and human evaluation of the generated animations.

Critical Analysis

The UniAnimate approach represents an important step forward in consistently animating human images using diffusion models. By incorporating human keypoint and motion information, the method is able to generate more coherent and realistic animations compared to unconstrained diffusion.

However, the paper does acknowledge some limitations. For example, the method still struggles with handling occlusions and complex interactions, and the animation quality can degrade for very fast or erratic motions. Additionally, the system currently requires manual specification of the target motion, which may limit its real-world applicability.

Further research could explore ways to make UniAnimate more robust to these challenges, such as by incorporating more advanced motion modeling or self-supervised learning of motion patterns. There may also be opportunities to extend the approach to handle a wider range of subjects beyond just human figures.

Overall, UniAnimate represents a promising advance in the field of video animation, with the potential to enable more consistent and realistic depictions of human motion in a variety of applications.

Conclusion

The UniAnimate paper presents an effective method for animating human images using unified video diffusion models. By leveraging human keypoint and motion information, the approach is able to generate animations that are more coherent and realistic compared to prior state-of-the-art techniques.

This work represents an important step forward in enabling high-quality, consistent human animation, which has many potential applications in areas like virtual reality, digital entertainment, and human-computer interaction. While the method still has some limitations, the authors' innovative approach and strong experimental results suggest that UniAnimate is a valuable contribution to the field of video animation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👁️

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

6/14/2024

cs.CV

LoopAnimate: Loopable Salient Object Animation

Fanyi Wang, Peng Liu, Haotian Hu, Dan Meng, Jingwen Su, Jinjin Xu, Yanhao Zhang, Xiaoming Ren, Zhiwang Zhang

Research on diffusion model-based video generation has advanced rapidly. However, limitations in object fidelity and generation length hinder its practical applications. Additionally, specific domains like animated wallpapers require seamless looping, where the first and last frames of the video match seamlessly. To address these challenges, this paper proposes LoopAnimate, a novel method for generating videos with consistent start and end frames. To enhance object fidelity, we introduce a framework that decouples multi-level image appearance and textual semantic information. Building upon an image-to-image diffusion model, our approach incorporates both pixel-level and feature-level information from the input image, injecting image appearance and textual semantic embeddings at different positions of the diffusion model. Existing UNet-based video generation models require to input the entire videos during training to encode temporal and positional information at once. However, due to limitations in GPU memory, the number of frames is typically restricted to 16. To address this, this paper proposes a three-stage training strategy with progressively increasing frame numbers and reducing fine-tuning modules. Additionally, we introduce the Temporal E nhanced Motion Module(TEMM) to extend the capacity for encoding temporal and positional information up to 36 frames. The proposed LoopAnimate, which for the first time extends the single-pass generation length of UNet-based video generation models to 35 frames while maintaining high-quality video generation. Experiments demonstrate that LoopAnimate achieves state-of-the-art performance in both objective metrics, such as fidelity and temporal consistency, and subjective evaluation results.

4/17/2024

cs.CV cs.AI

Controllable Longer Image Animation with Diffusion Models

Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu

Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/

5/29/2024

cs.CV

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu

Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

5/29/2024

cs.CV