SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

2405.18503

Published 6/12/2024 by Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

cs.SD cs.LG eess.AS

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Abstract

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.

Create account to get full access

Overview

This paper introduces SoundCTM, a novel text-to-sound generation model that combines score-based and consistency models.
SoundCTM aims to produce high-quality audio outputs that are coherent with the input text.
The model leverages the strengths of both score-based and consistency-based approaches to achieve improved performance.

Plain English Explanation

SoundCTM is a new AI system that can generate sound from text. It combines two different techniques - "score-based" and "consistency" models - to create realistic and coherent audio outputs.

The <a href="https://aimodels.fyi/papers/arxiv/music-consistency-models">consistency model</a> helps ensure the generated audio is logically consistent with the input text. The <a href="https://aimodels.fyi/papers/arxiv/soundlocd-efficient-conditional-discrete-contrastive-latent-diffusion">score-based approach</a> helps produce high-quality, natural-sounding audio. By bringing these two methods together, SoundCTM can create audio that not only sounds good, but also matches the meaning and context of the original text.

This is a significant advance over previous text-to-sound models, which tended to struggle with either creating coherent audio or maintaining fidelity to the input text. SoundCTM's hybrid approach allows it to overcome these limitations and generate more compelling audio outputs.

Technical Explanation

SoundCTM is a novel text-to-sound generation model that combines <a href="https://aimodels.fyi/papers/arxiv/phased-consistency-model">score-based</a> and <a href="https://aimodels.fyi/papers/arxiv/c3llm-conditional-multimodal-content-generation-using-large">consistency-based</a> approaches. The score-based component uses diffusion models to generate high-fidelity audio, while the consistency model ensures the output is semantically and acoustically aligned with the input text.

The model architecture includes text and audio encoders, a diffusion-based audio generator, and a consistency module that evaluates and refines the generated audio. During training, the system learns to optimize both the perceptual quality of the audio and its coherence with the text prompt.

Experiments show that SoundCTM outperforms prior text-to-sound models on both objective and human evaluation metrics. It is able to generate more natural-sounding and contextually appropriate audio compared to approaches that rely solely on score-based or consistency-based techniques.

Critical Analysis

The authors acknowledge that SoundCTM still has room for improvement, particularly in terms of handling more complex and diverse text prompts. The model may struggle with generating audio for prompts that involve abstract concepts or require deeper understanding of language and meaning.

Additionally, the training process for SoundCTM is computationally intensive, which could limit its practical deployment in real-world applications. Further research is needed to optimize the model's efficiency and scalability.

While the results demonstrate the benefits of uniting score-based and consistency-based approaches, the paper does not provide a deep analysis of the relative contributions of each component. Understanding the specific strengths and weaknesses of these two techniques could help guide future improvements to the model.

Conclusion

The SoundCTM model represents a significant advancement in text-to-sound generation by combining the strengths of score-based and consistency-based techniques. This hybrid approach allows the system to generate high-quality audio that is closely aligned with the input text, overcoming the limitations of previous methods.

The successful implementation of SoundCTM suggests that integrating complementary modeling strategies can lead to substantial performance gains in multimodal generation tasks. As the field of AI-powered content creation continues to evolve, this work highlights the value of exploring novel architectural designs that leverage the unique capabilities of different modeling paradigms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Music Consistency Models

Zhengcong Fei, Mingyuan Fan, Junshi Huang

Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.

4/23/2024

cs.SD cs.AI eess.AS

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing CFG-aware latent consistency model, which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.

6/26/2024

cs.SD cs.LG cs.MM eess.AS

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.

6/4/2024

eess.AS cs.SD

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

5/27/2024

cs.SD eess.AS