mustango

289

Last updated 9/19/2024

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

Mustango is an exciting addition to the world of Multimodal Large Language Models designed for controlled music generation. Developed by the declare-lab team, Mustango leverages Latent Diffusion Model (LDM), Flan-T5, and musical features to generate music from text prompts. It builds upon the work of similar models like MusicGen and MusicGen Remixer, but with a focus on more fine-grained control and improved overall music quality.

Model inputs and outputs

Mustango takes in a text prompt describing the desired music and generates an audio file in response. The model can be used to create a wide range of musical styles, from ambient to pop, by crafting the right prompts.

Inputs

Prompt: A text description of the desired music, including details about the instrumentation, genre, tempo, and mood.

Outputs

Audio file: A generated audio file containing the music based on the input prompt.

Capabilities

Mustango demonstrates impressive capabilities in generating music that closely matches the provided text prompt. The model is able to capture details like instrumentation, rhythm, and mood, and translate them into coherent musical compositions. Compared to earlier text-to-music models, Mustango shows significant improvements in terms of overall musical quality and coherence.

What can I use it for?

Mustango opens up a world of possibilities for content creators, musicians, and hobbyists alike. The model can be used to generate custom background music for videos, podcasts, or video games. Composers could leverage Mustango to quickly prototype musical ideas or explore new creative directions. Advertisers and marketers may find the model useful for generating jingles or soundtracks for their campaigns.

Things to try

One interesting aspect of Mustango is its ability to generate music in a variety of styles based on the input prompt. Try experimenting with different genres, moods, and levels of detail in your prompts to see the diverse range of musical compositions the model can produce. Additionally, the team has released several pre-trained models, including a Mustango Pretrained version, which may be worth exploring for specific use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

tango

declare-lab

Tango is a latent diffusion model (LDM) for text-to-audio (TTA) generation, capable of generating realistic audios including human sounds, animal sounds, natural and artificial sounds, and sound effects from textual prompts. It uses the frozen instruction-tuned language model Flan-T5 as the text encoder and trains a UNet-based diffusion model for audio generation. Compared to current state-of-the-art TTA models, Tango performs comparably across both objective and subjective metrics, despite training on a dataset 63 times smaller. The maintainer has released the model, training, and inference code for the research community. Tango 2 is a follow-up to Tango, built upon the same foundation but with additional alignment training using Direct Preference Optimization (DPO) on the Audio-alpaca dataset, a pairwise text-to-audio preference dataset. This helps Tango 2 generate higher-quality and more aligned audio outputs. Model inputs and outputs Inputs Prompt**: A textual description of the desired audio to be generated. Steps**: The number of steps to use for the diffusion-based audio generation process, with more steps typically producing higher-quality results at the cost of longer inference time. Guidance**: The guidance scale, which controls the trade-off between sample quality and sample diversity during the audio generation process. Outputs Audio**: The generated audio clip corresponding to the input prompt, in WAV format. Capabilities Tango and Tango 2 can generate a wide variety of realistic audio clips, including human sounds, animal sounds, natural and artificial sounds, and sound effects. For example, they can generate sounds of an audience cheering and clapping, rolling thunder with lightning strikes, or a car engine revving. What can I use it for? The Tango and Tango 2 models can be used for a variety of applications, such as: Audio content creation**: Generating audio clips for videos, games, podcasts, and other multimedia projects. Sound design**: Creating custom sound effects for various applications. Music composition**: Generating musical elements or accompaniment for songwriting and composition. Accessibility**: Generating audio descriptions for visually impaired users. Things to try You can try generating various types of audio clips by providing different prompts to the Tango and Tango 2 models, such as: Everyday sounds (e.g., a dog barking, water flowing, a car engine revving) Natural phenomena (e.g., thunderstorms, wind, rain) Musical instruments and soundscapes (e.g., a piano playing, a symphony orchestra) Human vocalizations (e.g., laughter, cheering, singing) Ambient and abstract sounds (e.g., a futuristic machine, alien landscapes) Experiment with the number of steps and guidance scale to find the right balance between sample quality and generation time for your specific use case.

Updated Invalid Date

Text-to-Audio

📊

omnizart

music-and-culture-technology-lab

Omnizart is a Python library developed by the Music and Culture Technology (MCT) Lab that aims to democratize automatic music transcription. It can transcribe various musical elements such as pitched instruments, vocal melody, chords, drum events, and beat from polyphonic audio. Omnizart is powered by research outcomes from the MCT Lab and has been published in the Journal of Open Source Software (JOSS). Similar AI models in this domain include music-classifiers for music classification, piano-transcription for high-resolution piano transcription, mustango for controllable text-to-music generation, and musicgen for music generation from prompts or melodies. Model inputs and outputs Omnizart takes an audio file in MP3 or WAV format as input and can output transcriptions for various musical elements. Inputs audio**: Path to the input music file in MP3 or WAV format. mode**: The specific transcription task to perform, such as music-piano, chord, drum, vocal, vocal-contour, or beat. Outputs The output is an array of objects, where each object contains: file: The path to the input audio file. text: The transcription result as text. Capabilities Omnizart can transcribe a wide range of musical elements, including pitched instruments, vocal melody, chords, drum events, and beat. This allows users to extract structured musical information from audio recordings, enabling applications such as music analysis, music information retrieval, and computer-assisted music composition. What can I use it for? With Omnizart, you can transcribe your favorite songs and explore the underlying musical structure. The transcriptions can be used for various purposes, such as: Music analysis**: Analyze the harmonic progressions, rhythmic patterns, and melodic lines of a piece of music. Music information retrieval**: Extract relevant metadata from audio recordings, such as chord changes, drum patterns, and melody, to enable more sophisticated music search and recommendations. Computer-assisted music composition**: Use the transcribed musical elements as a starting point for creating new compositions or arrangements. Things to try Try using Omnizart to transcribe different genres of music and explore the nuances in how it handles various musical elements. You can also experiment with the different transcription modes to see how the results vary and gain insights into the strengths and limitations of the model.

Updated Invalid Date

Audio-to-Text

musicgen

meta

2.0K

musicgen is a simple and controllable model for music generation developed by Meta. Unlike existing methods like MusicLM, musicgen doesn't require a self-supervised semantic representation and generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the authors show they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. musicgen was trained on 20K hours of licensed music, including an internal dataset of 10K high-quality music tracks and music data from ShutterStock and Pond5. Model inputs and outputs musicgen takes in a text prompt or melody and generates corresponding music. The model's inputs include a description of the desired music, an optional input audio file to influence the generated output, and various parameters to control the generation process like temperature, top-k, and top-p sampling. The output is a generated audio file in WAV format. Inputs Prompt**: A description of the music you want to generate. Input Audio**: An optional audio file that will influence the generated music. If "continuation" is set to true, the generated music will be a continuation of the input audio. Otherwise, it will mimic the input audio's melody. Duration**: The duration of the generated audio in seconds. Continuation Start/End**: The start and end times of the input audio to use for continuation. Various generation parameters**: Settings like temperature, top-k, top-p, etc. to control the diversity and quality of the generated output. Outputs Generated Audio**: A WAV file containing the generated music. Capabilities musicgen can generate a wide variety of music styles and genres based on the provided text prompt. For example, you could ask it to generate "tense, staccato strings with plucked dissonant strings, like a scary movie soundtrack" and it would produce corresponding music. The model can also continue or mimic the melody of an input audio file, allowing for more coherent and controlled music generation. What can I use it for? musicgen could be used for a variety of applications, such as: Background music generation**: Automatically generating custom music for videos, games, or other multimedia projects. Music composition assistance**: Helping musicians and composers come up with new musical ideas or sketches to build upon. Audio creation for content creators**: Allowing YouTubers, podcasters, and other content creators to easily add custom music to their projects. Things to try One interesting aspect of musicgen is its ability to generate music in parallel by predicting the different codebook components separately. This allows for faster generation compared to previous autoregressive music models. You could try experimenting with different generation parameters to find the right balance between generation speed, diversity, and quality for your use case. Additionally, the model's ability to continue or mimic input audio opens up possibilities for interactive music creation workflows, where users could iterate on an initial seed melody or prompt to refine the generated output.

Updated Invalid Date

Text-to-Audio

audio-ldm

haoheliu

audio-ldm is a text-to-audio generation model created by Haohe Liu, a researcher at CVSSP. It uses latent diffusion models to generate audio based on text prompts. The model is similar to stable-diffusion, a widely-used latent text-to-image diffusion model, but applied to the audio domain. It is also related to models like riffusion, which generates music from text, and whisperx, which transcribes audio. However, audio-ldm is focused specifically on generating a wide range of audio content from text. Model inputs and outputs The audio-ldm model takes in a text prompt as input and generates an audio clip as output. The text prompt can describe the desired sound, such as "a hammer hitting a wooden surface" or "children singing". The model then produces an audio clip that matches the text prompt. Inputs Text**: A text prompt describing the desired audio to generate. Duration**: The duration of the generated audio clip in seconds. Higher durations may lead to out-of-memory errors. Random Seed**: An optional random seed to control the randomness of the generation. N Candidates**: The number of candidate audio clips to generate, with the best one selected. Guidance Scale**: A parameter that controls the balance between audio quality and diversity. Higher values lead to better quality but less diversity. Outputs Audio Clip**: The generated audio clip that matches the input text prompt. Capabilities audio-ldm is capable of generating a wide variety of audio content from text prompts, including speech, sound effects, music, and beyond. It can also perform audio-to-audio generation, where it generates a new audio clip that has similar sound events to a provided input audio. Additionally, the model supports text-guided audio-to-audio style transfer, where it can transfer the sound of an input audio clip to match a text description. What can I use it for? audio-ldm could be useful for various applications, such as: Creative content generation**: Generating audio content for use in videos, games, or other multimedia projects. Audio post-production**: Automating the creation of sound effects or music to complement visual content. Accessibility**: Generating audio descriptions for visually impaired users. Education and research**: Exploring the capabilities of text-to-audio generation models. Things to try When using audio-ldm, try providing more detailed and descriptive text prompts to get better quality results. Experiment with different random seeds to see how they affect the generation. You can also try combining audio-ldm with other audio tools and techniques, such as audio editing or signal processing, to create even more interesting and compelling audio content.

Updated Invalid Date

Text-to-Audio