Bidirectional Autoregressive Diffusion Model for Dance Generation

2402.04356

Published 6/26/2024 by Canyu Zhang, Youbao Tang, Ning Zhang, Ruei-Sung Lin, Mei Han, Jing Xiao, Song Wang

Bidirectional Autoregressive Diffusion Model for Dance Generation

Abstract

Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create entire motion sequences directly and unidirectionally, lacking focus on the motion with local and bidirectional enhancement. When choreographing high-quality dance movements, people need to take into account not only the musical context but also the nearby music-aligned dance motions. To authentically capture human behavior, we propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions. To make the generated dance motion smoother, a local information decoder is built for local motion enhancement. The proposed framework is able to generate new motions based on the input conditions and nearby motions, which foresees individual motion slices iteratively and consolidates all predictions. To further refine the synchronicity between the generated dance and the beat, the beat information is incorporated as an input to generate better music-aligned dance movements. Experimental results demonstrate that the proposed model achieves state-of-the-art performance compared to existing unidirectional approaches on the prominent benchmark for music-to-dance generation.

Create account to get full access

Overview

This paper presents a novel approach for generating dance motions that can synchronize with any given music beat.
The proposed model, called the Bidirectional Autoregressive Diffusion Model (BADM), leverages a bidirectional autoregressive model and a diffusion-based generation model to generate realistic and coherent dance motions.
The model is trained on a large dataset of motion capture data and can generate dance moves that seamlessly match the rhythm and tempo of any input music track.

Plain English Explanation

The paper describes a new way to generate realistic dance moves that can be synchronized with any piece of music. The key idea is to use a special type of machine learning model that can learn the patterns and structure of dance motions from a large dataset of motion capture data. This model, called the Bidirectional Autoregressive Diffusion Model (BADM), is able to generate new dance moves that match the rhythm and beat of any input music.

The model works by first analyzing the music track to understand its tempo and rhythm. It then uses a bidirectional autoregressive model to generate a sequence of dance motions that are coherent and natural-looking. The diffusion-based generation model is then used to refine and polish the generated dance moves, ensuring that they are perfectly synchronized with the music.

This approach allows the model to generate a wide variety of dance moves that can be seamlessly blended with any music track, without the need for manual choreography or animation. This could have exciting applications in the entertainment industry, such as in the creation of interactive dance-based video games or virtual performances.

Technical Explanation

The core of the proposed model is a bidirectional autoregressive diffusion model that can generate realistic and coherent dance motions. The model takes in a sequence of music features (e.g., beats, tempo) and uses a bidirectional autoregressive architecture to generate a corresponding sequence of dance poses.

To further improve the generated dance motions, the model employs a diffusion-based generation process. This involves gradually adding noise to the generated dance poses and then learning to reverse the process to produce high-quality, natural-looking dance moves that are perfectly synchronized with the input music.

The model is trained on a large dataset of motion capture data, which allows it to learn the underlying patterns and structures of dance motions. During inference, the model takes in a new music track and generates a corresponding sequence of dance poses that seamlessly match the rhythm and tempo of the music.

Critical Analysis

The paper presents a promising approach for generating dance motions that can be synchronized with any input music. The use of a bidirectional autoregressive model and a diffusion-based generation process allows the model to generate realistic and coherent dance moves that are well-aligned with the music.

However, the paper does not address potential limitations of the approach, such as the ability to handle a diverse range of dance styles or the potential for the model to generate unnatural or repetitive dance motions. Additionally, the paper does not provide a detailed comparison to other state-of-the-art dance generation models, such as Dance Any Beat or TAMING, which could help contextualize the contributions of this work.

Further research could explore the model's ability to generalize to new dance styles, its scalability to longer dance sequences, and its potential integration with other technologies, such as virtual reality or interactive game environments.

Conclusion

The Bidirectional Autoregressive Diffusion Model (BADM) presented in this paper offers a novel approach for generating dance motions that can be seamlessly synchronized with any input music. By leveraging a bidirectional autoregressive model and a diffusion-based generation process, the model is able to produce realistic and coherent dance moves that are well-aligned with the rhythm and tempo of the music.

This research has the potential to significantly impact the entertainment industry, enabling the creation of interactive dance-based experiences and virtual performances without the need for manual choreography or animation. While the paper highlights the strengths of the proposed approach, further research is needed to address the potential limitations and fully realize the capabilities of this technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Xuanchen Wang, Heng Wang, Dongnan Liu, Weidong Cai

The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: https://DabFusion.github.io.

5/16/2024

cs.CV cs.AI cs.MM cs.SD eess.AS

BAMM: Bidirectional Autoregressive Motion Model

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, Chen Chen

Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures. Our project page is available at https://github.com/exitudio/BAMM-page.

4/1/2024

cs.CV

Flexible Music-Conditioned Dance Generation with Style Description Prompts

Hongsong Wang, Yin Zhu, Xin Geng

Dance plays an important role as an artistic form and expression in human culture, yet the creation of dance remains a challenging task. Most dance generation methods primarily rely solely on music, seldom taking into consideration intrinsic attributes such as music style or genre. In this work, we introduce Flexible Dance Generation with Style Description Prompts (DGSDP), a diffusion-based framework suitable for diversified tasks of dance generation by fully leveraging the semantics of music style. The core component of this framework is Music-Conditioned Style-Aware Diffusion (MCSAD), which comprises a Transformer-based network and a music Style Modulation module. The MCSAD seemly integrates music conditions and style description prompts into the dance generation framework, ensuring that generated dances are consistent with the music content and style. To facilitate flexible dance generation and accommodate different tasks, a spatial-temporal masking strategy is effectively applied in the backward diffusion process. The proposed framework successfully generates realistic dance sequences that are accurately aligned with music for a variety of tasks such as long-term generation, dance in-betweening, dance inpainting, and etc. We hope that this work has the potential to inspire dance generation and creation, with promising applications in entertainment, art, and education.

6/13/2024

cs.CV cs.MM cs.SD eess.AS

Shape Conditioned Human Motion Generation with Diffusion Model

Kebing Xue, Hyewon Seo

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.

5/14/2024

cs.CV cs.GR