
Maintainer: riffusion

Total Score


Last updated 5/28/2024


Run this modelRun on HuggingFace
API specView on HuggingFace
Github linkNo Github link provided
Paper linkNo paper link provided

Create account to get full access


If you already have an account, we'll log you in

Model overview

riffusion-model-v1 is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips. The model was created by fine-tuning the Stable Diffusion checkpoint. The Riffusion model was developed by Seth Forsgren and Hayk Martiros as a hobby project.

Model inputs and outputs

The riffusion-model-v1 takes text prompts as input and generates spectrogram images as output. These spectrograms can then be converted into audio clips.


  • Text prompt: Any text input that describes the desired audio clip.


  • Spectrogram image: An image containing a visual representation of the audio signal's frequency content over time.


The riffusion-model-v1 is capable of generating a wide variety of audio content based on text prompts, from musical melodies to sound effects. By leveraging the capabilities of Stable Diffusion, the model can create unique and creative audio outputs that align with the provided text input.

What can I use it for?

The riffusion-model-v1 model is intended for research purposes only. Possible use cases include the generation of artistic audio content, exploration of the limitations and biases of generative audio models, and the development of educational or creative tools. The model should not be used to intentionally create or disseminate harmful or offensive content.

Things to try

Experiment with different text prompts to see the variety of audio outputs the riffusion-model-v1 can generate. Try prompts that describe specific genres, instruments, or sound effects to see how the model responds. Additionally, you can explore the model's capabilities by combining text prompts with the Riffusion web app to create interactive audio experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

AI model preview image



Total Score


riffusion is a library for real-time music and audio generation using the Stable Diffusion text-to-image diffusion model. It was developed by Seth Forsgren and Hayk Martiros as a hobby project. riffusion fine-tunes Stable Diffusion to generate spectrogram images that can be converted into audio clips, allowing for the creation of music based on text prompts. This is in contrast to other similar models like inkpunk-diffusion and multidiffusion which focus on visual art generation. Model inputs and outputs riffusion takes in a text prompt, an optional second prompt for interpolation, a seed image ID, and parameters controlling the diffusion process. It outputs a spectrogram image and the corresponding audio clip. Inputs Prompt A**: The primary text prompt describing the desired audio Prompt B**: An optional second prompt to interpolate with the first Alpha**: The interpolation value between the two prompts, from 0 to 1 Denoising**: How much to transform the input spectrogram, from 0 to 1 Seed Image ID**: The ID of a seed spectrogram image to use Num Inference Steps**: The number of steps to run the diffusion model Outputs Spectrogram Image**: A spectrogram visualization of the generated audio Audio Clip**: The generated audio clip in MP3 format Capabilities riffusion can generate a wide variety of musical styles and genres based on the provided text prompts. For example, it can create "funky synth solos", "jazz with piano", or "church bells on Sunday". The model is able to capture complex musical concepts and translate them into coherent audio clips. What can I use it for? The riffusion model is intended for research and creative applications. It could be used to generate audio for educational or creative tools, or as part of artistic projects exploring the intersection of language and music. Additionally, researchers studying generative models and the connection between text and audio may find riffusion useful for their work. Things to try One interesting aspect of riffusion is its ability to interpolate between two text prompts. By adjusting the alpha parameter, you can create a smooth transition from one style of music to another, allowing for the generation of unique and unexpected audio clips. Another interesting area to explore is the model's handling of seed images - by providing different starting spectrograms, you can influence the character and direction of the generated music.

Read more

Updated Invalid Date




Total Score


stable-diffusion-v1-1 is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. It was trained on 237,000 steps at resolution 256x256 on laion2B-en, followed by 194,000 steps at resolution 512x512 on laion-high-resolution. The model is intended to be used with the Diffusers library. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (CLIP ViT-L/14) as suggested in the Imagen paper. Similar models like stable-diffusion-v1-4 have been trained for longer and are usually better in terms of image generation quality. The stable-diffusion model provides an overview of the various Stable Diffusion model checkpoints. Model inputs and outputs Inputs Text prompt**: A text description of the desired image to generate. Outputs Generated image**: A photo-realistic image matching the input text prompt. Capabilities stable-diffusion-v1-1 can generate a wide variety of images from text prompts, including realistic scenes, abstract art, and imaginative creations. For example, it can create images of "a photo of an astronaut riding a horse on mars", "a painting of a unicorn in a fantasy landscape", or "a surreal portrait of a robot musician". What can I use it for? The stable-diffusion-v1-1 model is intended for research purposes only. Possible use cases include: Safe deployment of models that can generate potentially harmful content Probing and understanding the limitations and biases of generative models Generation of artworks and use in design and other creative processes Applications in educational or creative tools Research on generative models The model should not be used to intentionally create or disseminate images that are disturbing, offensive, or propagate harmful stereotypes. Things to try Some interesting things to try with stable-diffusion-v1-1 include: Experimenting with different text prompts to see the range of images the model can generate Trying out different noise schedulers to see how they affect the output Exploring the model's capabilities and limitations, such as its ability to render text or handle complex compositions Investigating ways to mitigate potential biases and harmful outputs from the model

Read more

Updated Invalid Date




Total Score


Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images from text prompts. This model was developed by CompVis, and improves upon previous text-to-image models through a series of training iterations. The model is available in several versions, with higher versions usually producing better image quality. stable-diffusion-v1-4 is the latest version, having been trained for 225,000 steps at 512x512 resolution on a filtered subset of the LAION-5B dataset with improved aesthetics. This version also uses 10% text conditioning dropout to improve classifier-free guidance sampling. Model inputs and outputs Stable Diffusion takes a text prompt as input and generates a corresponding photo-realistic image as output. The model encodes the text prompt using a pretrained text encoder, and then generates the image in a latent space before decoding it back to the pixel domain. Inputs Text prompt**: A natural language description of the desired image content. Outputs Image**: A photo-realistic image corresponding to the input text prompt. Capabilities Stable Diffusion is capable of generating a wide variety of photorealistic images from textual descriptions. It can create scenes, objects, characters, and more with a high level of detail and quality. The model has been found to excel at tasks like generating landscapes, portraits, and imaginative scenes. What can I use it for? Stable Diffusion can be used for a variety of creative and research applications. Artists and designers can use it to rapidly generate visual concepts and explore new ideas. Educators can incorporate it into lesson plans to spark creativity and visual thinking. Researchers can study the model's biases and limitations to better understand the capabilities and challenges of text-to-image generation. While the model has impressive capabilities, it should not be used to generate harmful or deceptive content. The Stable Diffusion v2 Model Card outlines several excluded use cases, such as generating demeaning or discriminatory content, impersonating individuals without consent, and creating misinformation. Things to try One interesting aspect of Stable Diffusion is its ability to combine disparate concepts in novel ways. Try prompting the model with unusual juxtapositions, such as "a dragon riding a bicycle" or "a penguin in a spacesuit". Explore how the model integrates these elements and the types of images it generates. Another area to experiment with is the model's treatment of scale and perspective. See how it handles requests for scenes with both small and large elements, or try varying the level of detail and realism in the prompt. The model's performance on these types of compositional challenges can provide insight into its underlying capabilities and limitations.

Read more

Updated Invalid Date




Total Score


stable-diffusion-v1-5 is a latent text-to-image diffusion model developed by runwayml that can generate photo-realistic images from text prompts. It was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and then fine-tuned on 595k steps at 512x512 resolution on the "laion-aesthetics v2 5+" dataset. This fine-tuning included a 10% dropping of the text-conditioning to improve classifier-free guidance sampling. Similar models include the Stable-Diffusion-v1-4 checkpoint, which was trained on 225k steps at 512x512 resolution on "laion-aesthetics v2 5+" with 10% text-conditioning dropping, as well as the coreml-stable-diffusion-v1-5 model, which is a version of the stable-diffusion-v1-5 model converted for use on Apple Silicon hardware. Model inputs and outputs Inputs Text prompt**: A textual description of the desired image to generate. Outputs Generated image**: A photo-realistic image that matches the provided text prompt. Capabilities The stable-diffusion-v1-5 model can generate a wide variety of photo-realistic images from text prompts. For example, it can create images of imaginary scenes, like "a photo of an astronaut riding a horse on mars", as well as more realistic images, like "a photo of a yellow cat sitting on a park bench". The model is able to capture details like lighting, textures, and composition, resulting in highly convincing and visually appealing outputs. What can I use it for? The stable-diffusion-v1-5 model is intended for research purposes only. Potential use cases include: Generating artwork and creative content for design, education, or personal projects (using the Diffusers library) Probing the limitations and biases of generative models Developing safe deployment strategies for models with the potential to generate harmful content The model should not be used to create content that is disturbing, offensive, or propagates harmful stereotypes. Excluded uses include generating demeaning representations, impersonating individuals without consent, or sharing copyrighted material. Things to try One interesting aspect of the stable-diffusion-v1-5 model is its ability to generate highly detailed and visually compelling images, even for complex or fantastical prompts. Try experimenting with prompts that combine multiple elements, like "a photo of a robot unicorn fighting a giant mushroom in a cyberpunk city". The model's strong grasp of composition and lighting can result in surprisingly coherent and imaginative outputs. Another area to explore is the model's flexibility in handling different styles and artistic mediums. Try prompts that reference specific art movements, like "a Monet-style painting of a sunset over a lake" or "a cubist portrait of a person". The model's latent diffusion approach allows it to capture a wide range of visual styles and aesthetics.

Read more

Updated Invalid Date