SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

2405.00878

Published 5/3/2024 by Burak Can Biner, Farrin Marouf Sofian, Umur Berkay Karakac{s}, Duygu Ceylan, Erkut Erdem, Aykut Erdem

cs.CV

SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

Abstract

We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.

Create account to get full access

Overview

This paper introduces SonicDiffusion, a system that can generate and edit images based on audio input, using pretrained diffusion models.
The key idea is to use a diffusion model trained on images, and condition it on audio features to produce relevant images.
This allows for audio-driven image generation and editing, enabling new creative applications.

Plain English Explanation

SonicDiffusion is a system that can create and modify images using only sound as input. It works by taking a diffusion model that has been trained on a large dataset of images, and then conditioning that model on audio features. This means the model can use the information in the audio - such as the sounds, music, or speech - to generate new images or edit existing ones in interesting ways.

For example, you could play a piano melody and SonicDiffusion would generate an image of a piano or a pianist. Or you could describe an object or scene in words, and the system would try to depict that using the information in your speech. The Tango 2: Aligning Diffusion-based Text-to paper explores similar ideas for text-to-image generation.

This audio-driven approach opens up new creative possibilities compared to traditional image editing tools. Instead of laboriously manipulating pixels or layers, you can simply describe what you want to see or play relevant sounds, and the system will translate that into a visual output. The DiscFFusion: Discriminative Diffusion Models as Few-shot paper discusses how diffusion models can be adapted for few-shot image generation tasks.

Technical Explanation

SonicDiffusion builds on recent advancements in diffusion models, which are a type of generative AI system that can produce highly realistic images. The core idea is to take an existing diffusion model that has been trained on a large dataset of images, and then condition that model on audio features extracted from input sounds.

The authors experiment with different ways of incorporating the audio information, such as feeding it directly into the diffusion model or using it to modulate the noise schedule. They also explore using contrastive learning to better align the audio and visual representations.

Through extensive experiments, the authors demonstrate that SonicDiffusion can generate relevant images from audio input, as well as allow for audio-driven editing of existing images. They show results across a variety of domains, including musical instruments, scenes, and objects.

The TextSliders: Diffusion-based Texture Editing in CLIP Space paper explores related ideas for text-guided texture editing, while the StoryDiffusion: Consistent Self-Attention for Long-Range Image paper looks at ways to improve the consistency of diffusion-based image generation.

Critical Analysis

One potential limitation of SonicDiffusion is that it relies on having a pretrained diffusion model that can produce high-quality images. The performance of the system is ultimately bounded by the capabilities of the underlying image generation model.

Additionally, the authors note that the audio-to-image mapping can be somewhat unpredictable, and the generated images may not always closely match the input audio. Further research may be needed to improve the robustness and reliability of the audio-driven image generation and editing.

The MaxFusion: Plug-and-Play Multi-Modal Generation from Text-to paper discusses some strategies for more reliable multimodal generation that could be relevant here.

Overall, SonicDiffusion represents an exciting step forward in the field of audio-visual AI, demonstrating the potential to create new forms of creative expression by combining these modalities. As the underlying technologies continue to improve, we can expect to see increasingly sophisticated and versatile audio-driven image systems in the future.

Conclusion

The SonicDiffusion paper presents a novel approach for generating and editing images using audio input. By conditioning a pretrained diffusion model on audio features, the system can translate sounds, speech, and music into corresponding visual outputs.

This audio-driven image generation and editing capability opens up new creative possibilities compared to traditional tools. Users can simply describe what they want to see or play relevant sounds, and the system will try to translate that into a visual representation.

While the system has some limitations and unpredictability, the core idea of combining diffusion models with audio inputs is a promising direction for future research. As the underlying technologies continue to advance, we can expect to see increasingly sophisticated and versatile audio-visual AI systems emerge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, Jos'e Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli

Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: avdit2024.github.io

5/24/2024

cs.CV cs.LG cs.MM cs.SD eess.AS

Complex Image-Generative Diffusion Transformer for Audio Denoising

Junhui Li, Pu Wang, Jialu Li, Youshan Zhang

The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this paper introduces a complex image-generative diffusion transformer that captures more information from the complex Fourier domain. We explore a novel diffusion transformer by integrating the transformer with a diffusion model. Our proposed model demonstrates the scalability of the transformer and expands the receptive field of sparse attention using attention diffusion. Our work is among the first to utilize diffusion transformers to deal with the image generation task for audio denoising. Extensive experiments on two benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods.

6/14/2024

cs.SD eess.AS

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

4/17/2024

cs.SD cs.AI cs.CL eess.AS

Prompt-guided Precise Audio Editing with Diffusion Models

Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusion models and enables precise audio editing. The editing is based on the input textual prompt only and is entirely training-free. We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process. Experimental results highlight the effectiveness of our method in various editing tasks.

6/10/2024

cs.SD cs.AI cs.LG eess.AS