Human Video Translation via Query Warping

Read original: arXiv:2402.12099 - Published 5/22/2024 by Haiming Zhu, Yangyang Xu, Shengfeng He

🏅

Overview

This paper presents QueryWarp, a new framework for temporally coherent human motion video translation.
Existing video editing approaches using diffusion models rely on key and value tokens, which can sacrifice the preservation of local and structural regions.
QueryWarp aims to consider complementary query priors by constructing temporal correlations among query tokens from different frames.
The framework first extracts appearance flows from source poses to capture continuous human foreground motion.
During the denoising process, QueryWarp employs these appearance flows to warp the previous frame's query token, aligning it with the current frame's query.
This query warping imposes explicit constraints on the outputs of self-attention layers, ensuring temporally coherent translation.

Plain English Explanation

QueryWarp: A Novel Framework for Temporally Coherent Human Motion Video Translation

Imagine you have a video of a person moving around, and you want to edit that video to change how the person is moving. Existing methods that use diffusion models (a type of machine learning algorithm) to do this can sometimes make the video look a bit off, with the person's movements not quite matching up from one frame to the next.

The QueryWarp framework aims to fix this by taking a different approach. Instead of just relying on the key and value tokens that the diffusion model uses, QueryWarp also considers additional information about the person's movements in the form of "appearance flows." These appearance flows are used to align the previous frame's information with the current frame, ensuring that the person's movements look smooth and consistent throughout the video.

The key innovation in QueryWarp is this idea of "query warping," where the framework uses the appearance flows to adjust the model's internal representations (the "query tokens") to better match the current frame. This helps the model produce video edits that look much more natural and coherent over time.

The researchers tested QueryWarp on a variety of human motion video translation tasks and found that it outperformed other state-of-the-art methods, both in terms of the quality of the resulting videos and the consistency of the person's movements.

Technical Explanation

QueryWarp: A Novel Framework for Temporally Coherent Human Motion Video Translation

The key contribution of this paper is the QueryWarp framework, which aims to address the issue of temporal incoherence in existing diffusion-based video editing approaches. These existing methods rely solely on key and value tokens to ensure temporal consistency, which can lead to the loss of local and structural information.

To overcome this, QueryWarp constructs temporal correlations among query tokens from different frames, considering complementary query priors. The framework first extracts appearance flows from source poses to capture continuous human foreground motion. During the denoising process of the diffusion model, QueryWarp then employs these appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation.

The researchers evaluated QueryWarp on various human motion video translation tasks and found that it outperforms state-of-the-art methods both qualitatively and quantitatively. The results demonstrate the effectiveness of the framework in preserving local and structural regions while ensuring temporal coherence in the generated videos.

Critical Analysis

The QueryWarp paper presents a novel and promising approach to addressing the challenge of temporally coherent human motion video translation. By incorporating appearance flows to align query tokens across frames, the framework effectively imposes constraints on the self-attention mechanism, leading to more consistent and natural-looking video edits.

One potential limitation of the research is that it has only been evaluated on human motion video translation tasks. It would be interesting to see how the QueryWarp framework might perform on other types of video editing or translation problems, such as those involving non-human subjects or more complex scenes.

Additionally, the paper does not provide much insight into the computational and memory requirements of the QueryWarp framework compared to other diffusion-based approaches. This information would be valuable for understanding the practical implications and potential deployment scenarios for the technology.

Overall, the QueryWarp paper presents a compelling and well-executed approach to improving the temporal coherence of diffusion-based video editing. The results are promising, and the framework's unique query warping mechanism offers a promising direction for further research and development in this area.

Conclusion

The QueryWarp paper introduces a novel framework for temporally coherent human motion video translation. By leveraging appearance flows to align query tokens across frames, the QueryWarp approach effectively addresses the limitations of existing diffusion-based methods that rely solely on key and value tokens.

The results demonstrate that QueryWarp outperforms state-of-the-art techniques, both qualitatively and quantitatively, in preserving local and structural regions while ensuring smooth and consistent human movements throughout the edited videos. This research represents an important step forward in improving the quality and realism of video editing and translation applications, with potential implications for a wide range of multimedia and visual effects use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

Human Video Translation via Query Warping

Haiming Zhu, Yangyang Xu, Shengfeng He

In this paper, we present QueryWarp, a novel framework for temporally coherent human motion video translation. Existing diffusion-based video editing approaches that rely solely on key and value tokens to ensure temporal consistency, which scarifies the preservation of local and structural regions. In contrast, we aim to consider complementary query priors by constructing the temporal correlations among query tokens from different frames. Initially, we extract appearance flows from source poses to capture continuous human foreground motion. Subsequently, during the denoising process of the diffusion model, we employ appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation. We perform experiments on various human motion video translation tasks, and the results demonstrate that our QueryWarp framework surpasses state-of-the-art methods both qualitatively and quantitatively.

5/22/2024

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.

4/16/2024

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian

Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.

5/27/2024

🛸

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

5/29/2024