Complex Image-Generative Diffusion Transformer for Audio Denoising

2406.09161

Published 6/14/2024 by Junhui Li, Pu Wang, Jialu Li, Youshan Zhang

Complex Image-Generative Diffusion Transformer for Audio Denoising

Abstract

The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this paper introduces a complex image-generative diffusion transformer that captures more information from the complex Fourier domain. We explore a novel diffusion transformer by integrating the transformer with a diffusion model. Our proposed model demonstrates the scalability of the transformer and expands the receptive field of sparse attention using attention diffusion. Our work is among the first to utilize diffusion transformers to deal with the image generation task for audio denoising. Extensive experiments on two benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods.

Create account to get full access

Overview

This paper introduces a novel "Complex Image-Generative Diffusion Transformer" model for audio denoising.
The model leverages diffusion-based generative models and transformer architectures to effectively remove noise from audio signals.
The authors demonstrate the model's capability to restore high-quality audio from noisy inputs across a range of noise levels and types.

Plain English Explanation

The researchers have developed a new type of artificial intelligence (AI) model that can effectively remove unwanted noise or distortion from audio recordings. This model combines two powerful AI techniques: diffusion models, which can generate high-quality audio, and transformers, which are specialized for processing sequential data like audio.

By combining these approaches, the researchers created a "Complex Image-Generative Diffusion Transformer" that can take in noisy audio and output clean, high-quality versions. This is useful for applications like speech recognition, music production, and audio-based user interfaces, where clear audio is essential.

The model was tested on a variety of noisy audio samples and was able to significantly reduce the noise while preserving the original audio content. This suggests the technique could be valuable for real-world audio denoising tasks, like cleaning up recordings made in noisy environments.

Technical Explanation

The core of the model is a diffusion-based generative architecture that learns to progressively remove noise from audio inputs. This is combined with a transformer-based neural network that can effectively model the complex relationships within the audio data.

The authors leverage recent advancements in diffusion Gaussian mixture models to enable the model to handle a wide range of noise levels and types. This flexibility is key for real-world audio denoising applications.

Experiments demonstrate the model's ability to restore high-fidelity audio from noisy inputs, outperforming previous state-of-the-art approaches on standard audio denoising benchmarks. The authors also provide analyses showing the model's robustness and its potential for further improvements.

Critical Analysis

The paper presents a technically sophisticated approach to audio denoising that builds on recent advancements in generative models and transformer architectures. The authors have demonstrated impressive results, but it's important to note that the model's performance may be dependent on the availability of high-quality training data.

Additionally, while the model is designed to handle a variety of noise types, there may be specific scenarios or edge cases where its performance could be limited. Further research and testing would be needed to fully understand the model's limitations and potential areas for improvement.

It would also be valuable to explore how this approach could be integrated with other audio processing techniques, such as speech enhancement or music separation, to create more comprehensive solutions for real-world audio applications.

Conclusion

The "Complex Image-Generative Diffusion Transformer" represents an innovative approach to audio denoising that leverages the strengths of diffusion models and transformers. The authors have demonstrated the model's ability to effectively remove noise from audio signals while preserving the original content.

This research contributes to the ongoing efforts to develop more robust and versatile audio processing techniques, which could have significant impact on a wide range of applications, from voice-based user interfaces to high-quality audio production. As the field of AI-powered audio processing continues to evolve, this work provides a promising direction for future exploration and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, Jos'e Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli

Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: avdit2024.github.io

5/24/2024

cs.CV cs.LG cs.MM cs.SD eess.AS

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

Burak Can Biner, Farrin Marouf Sofian, Umur Berkay Karakac{s}, Duygu Ceylan, Erkut Erdem, Aykut Erdem

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

5/3/2024

cs.CV

Diffusion Gaussian Mixture Audio Denoise

Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang

Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a DiffGMM model, a denoising model based on the diffusion and Gaussian mixture models. We employ the reverse process to estimate parameters for the Gaussian mixture model. Given a noisy audio signal, we first apply a 1D-U-Net to extract features and train linear layers to estimate parameters for the Gaussian mixture model, and we approximate the real noise distributions. The noisy signal is continuously subtracted from the estimated noise to output clean audio signals. Extensive experimental results demonstrate that the proposed DiffGMM model achieves state-of-the-art performance.

6/14/2024

cs.SD cs.CL eess.AS

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/

5/27/2024

cs.CV cs.LG cs.MM cs.SD eess.AS