Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Read original: arXiv:2409.01591 - Published 9/4/2024 by Sohan Anisetty, James Hays

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Overview

The paper presents a method for generating dynamic human motion from a combination of audio and text input.
It uses a Spatio-Temporal Transformer model to learn the relationship between audio, text, and motion, and can generate realistic full-body animations.
The model is trained on a large dataset of motion capture data aligned with audio and text, enabling it to learn the complex connections between these modalities.

Plain English Explanation

The researchers have developed a system that can create animated human movements based on a combination of audio and text inputs. This is a challenging task, as human motion is complex and nuanced, involving the coordination of many different body parts.

The key innovation is the use of a Spatio-Temporal Transformer model, which is able to learn the intricate relationships between the audio, text, and corresponding motion data. By training this model on a large dataset of motion capture recordings aligned with audio and text, it can learn to generate realistic full-body animations that match the provided inputs.

For example, the model might learn that a certain combination of spoken words and sounds corresponds to a particular pattern of body movements. It can then use this knowledge to create new animations that fit the given audio and text.

This technology has a wide range of potential applications, from enhancing video games and films to enabling more natural interactions with virtual assistants. By bridging the gap between audio, language, and motion, the researchers have developed a powerful tool for creating dynamic, responsive animations.

Technical Explanation

The paper introduces a novel Spatio-Temporal Transformer model for generating dynamic human motion from a combination of audio and text inputs. The model uses a transformer-based architecture to learn the complex relationships between the audio, text, and corresponding motion data.

The key components of the model include:

Audio and Text Encoders: These sub-models encode the input audio and text into compact representations that can be efficiently processed by the transformer.
Spatio-Temporal Transformer: This is the core of the system, which learns to map the encoded audio and text features to the desired human motion. It does this by attending to relevant parts of the input and generating the output motion frames sequentially.
Motion Decoder: This sub-model takes the transformer's output and generates the final full-body animation, ensuring that the motion is smooth and consistent.

The model is trained end-to-end on a large dataset of motion capture data aligned with audio and text. This allows it to learn the complex connections between these different modalities, enabling it to generate realistic animations that match the provided inputs.

The researchers evaluate the model on several benchmarks, demonstrating its ability to outperform previous state-of-the-art methods in terms of motion quality and synchronization with the audio and text. They also show that the model can generalize to new scenarios and handle a wide range of motion types.

Critical Analysis

The paper presents a compelling approach to the challenging problem of dynamic human motion synthesis from multi-modal inputs. The use of a Spatio-Temporal Transformer model is a novel and effective way to model the complex relationships between audio, text, and motion.

One potential limitation of the approach is the reliance on a large, high-quality dataset of motion capture data aligned with audio and text. Acquiring and preprocessing such a dataset can be a significant challenge, and the model's performance may be heavily dependent on the quality and diversity of the training data.

Additionally, the paper does not delve deeply into the interpretability of the model's internal representations and decision-making processes. Understanding how the model learns to map audio and text to motion could lead to further insights and improvements.

Future research could explore ways to make the model more sample-efficient, allowing it to be trained on smaller datasets or even generate motion from unseen audio and text inputs. Incorporating additional modalities, such as visual cues, could also enhance the model's capabilities and realism.

Conclusion

The paper presents a novel Spatio-Temporal Transformer model that can generate realistic human motion from a combination of audio and text inputs. By learning the complex relationships between these modalities, the model is able to create dynamic animations that are well-synchronized with the provided cues.

This technology has promising applications in areas such as video game development, film production, and virtual assistant interactions, where the ability to generate natural, responsive human motion can greatly enhance the user experience. As the field of multi-modal AI continues to advance, techniques like the one described in this paper will likely play an increasingly important role in bridging the gap between digital and physical worlds.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty, James Hays

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

9/4/2024

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serr`a

Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

7/16/2024

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian

Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.

5/27/2024

SpeechAct: Towards Generating Whole-body Motion from Speech

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Yebin Liu, Kun Li

This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.

6/17/2024