omnizart

Maintainer: music-and-culture-technology-lab

Last updated 9/17/2024

📊

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

Omnizart is a Python library developed by the Music and Culture Technology (MCT) Lab that aims to democratize automatic music transcription. It can transcribe various musical elements such as pitched instruments, vocal melody, chords, drum events, and beat from polyphonic audio. Omnizart is powered by research outcomes from the MCT Lab and has been published in the Journal of Open Source Software (JOSS).

Similar AI models in this domain include music-classifiers for music classification, piano-transcription for high-resolution piano transcription, mustango for controllable text-to-music generation, and musicgen for music generation from prompts or melodies.

Model inputs and outputs

Omnizart takes an audio file in MP3 or WAV format as input and can output transcriptions for various musical elements.

Inputs

audio: Path to the input music file in MP3 or WAV format.
mode: The specific transcription task to perform, such as music-piano, chord, drum, vocal, vocal-contour, or beat.

Outputs

The output is an array of objects, where each object contains:
- file: The path to the input audio file.
- text: The transcription result as text.

Capabilities

Omnizart can transcribe a wide range of musical elements, including pitched instruments, vocal melody, chords, drum events, and beat. This allows users to extract structured musical information from audio recordings, enabling applications such as music analysis, music information retrieval, and computer-assisted music composition.

What can I use it for?

With Omnizart, you can transcribe your favorite songs and explore the underlying musical structure. The transcriptions can be used for various purposes, such as:

Music analysis: Analyze the harmonic progressions, rhythmic patterns, and melodic lines of a piece of music.
Music information retrieval: Extract relevant metadata from audio recordings, such as chord changes, drum patterns, and melody, to enable more sophisticated music search and recommendations.
Computer-assisted music composition: Use the transcribed musical elements as a starting point for creating new compositions or arrangements.

Things to try

Try using Omnizart to transcribe different genres of music and explore the nuances in how it handles various musical elements. You can also experiment with the different transcription modes to see how the results vary and gain insights into the strengths and limitations of the model.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

piano-transcription

bytedance

The piano-transcription model is a high-resolution piano transcription system developed by ByteDance that can detect piano notes from audio. It is a powerful tool for converting piano recordings into MIDI files, enabling efficient storage and manipulation of musical performances. This model can be compared to similar music AI models like cantable-diffuguesion for generating and harmonizing Bach chorales, stable-diffusion for generating photorealistic images from text, musicgen-fine-tuner for fine-tuning music generation models, and whisperx for accelerated audio transcription. Model inputs and outputs The piano-transcription model takes an audio file as input and outputs a MIDI file representing the transcribed piano performance. The model can detect piano notes, their onsets, offsets, and velocities with high accuracy, enabling detailed, high-resolution transcription. Inputs audio_input**: The input audio file to be transcribed Outputs Output**: The transcribed MIDI file representing the piano performance Capabilities The piano-transcription model is capable of accurately detecting and transcribing piano performances, even for complex, virtuosic pieces. It can capture nuanced details like pedal use, note velocity, and precise onset and offset times. This makes it a valuable tool for musicians, composers, and music enthusiasts who want to digitize and analyze piano recordings. What can I use it for? The piano-transcription model can be used for a variety of applications, such as converting legacy analog recordings into digital MIDI files, creating sheet music from live performances, and building large-scale classical piano MIDI datasets like the GiantMIDI-Piano dataset developed by the model's creators. This can enable further research and development in areas like music information retrieval, score-informed source separation, and music generation. Things to try Experiment with the piano-transcription model by transcribing a variety of piano performances, from classical masterpieces to modern pop songs. Observe how the model handles different styles, dynamics, and pedal use. You can also try combining the transcribed MIDI files with other music AI tools, such as musicgen, to create new and innovative music compositions.

Updated Invalid Date

Audio-to-Text

music-classifiers

mtg

The music-classifiers model is a set of transfer learning models developed by MTG for classifying music by genres, moods, and instrumentation. It is part of the Essentia replicate demos hosted on Replicate.ai's MTG site. The model is built on top of the open-source Essentia library and provides pre-trained models for a variety of music classification tasks. This model can be seen as complementary to other music-related models like music-approachability-engagement, music-arousal-valence, and the MusicGen and MusicGen-melody models from Meta/Facebook, which focus on music generation and analysis. Model inputs and outputs Inputs Url**: YouTube URL to process (overrides audio input) Audio**: Audio file to process model_type**: Model type (embeddings) Outputs Output**: URI representing the processed audio Capabilities The music-classifiers model can be used to classify music along several dimensions, including genre, mood, and instrumentation. This can be useful for applications like music recommendation, audio analysis, and music information retrieval. The model provides pre-trained embeddings that can be used as features for a variety of downstream music-related tasks. What can I use it for? Researchers and developers working on music-related applications can use the music-classifiers model to add music classification capabilities to their projects. For example, the model could be integrated into a music recommendation system to suggest songs based on genre or mood, or used in an audio analysis tool to automatically extract information about the instrumentation in a piece of music. The model's pre-trained embeddings can also be used as input features for custom machine learning models, enabling more sophisticated music analysis and classification tasks. Things to try One interesting thing to try with the music-classifiers model is to experiment with different music genres, moods, and instrumentation combinations to see how the model classifies and represents them. You could also try using the model's embeddings as input features for your own custom music classification models, and see how they perform compared to other approaches. Additionally, you could explore ways to combine the music-classifiers model with other music-related models, such as the MusicGen and MusicGen-melody models, to create more comprehensive music analysis and generation pipelines.

Updated Invalid Date

Audio-to-Text

mustango

declare-lab

289

Mustango is an exciting addition to the world of Multimodal Large Language Models designed for controlled music generation. Developed by the declare-lab team, Mustango leverages Latent Diffusion Model (LDM), Flan-T5, and musical features to generate music from text prompts. It builds upon the work of similar models like MusicGen and MusicGen Remixer, but with a focus on more fine-grained control and improved overall music quality. Model inputs and outputs Mustango takes in a text prompt describing the desired music and generates an audio file in response. The model can be used to create a wide range of musical styles, from ambient to pop, by crafting the right prompts. Inputs Prompt**: A text description of the desired music, including details about the instrumentation, genre, tempo, and mood. Outputs Audio file**: A generated audio file containing the music based on the input prompt. Capabilities Mustango demonstrates impressive capabilities in generating music that closely matches the provided text prompt. The model is able to capture details like instrumentation, rhythm, and mood, and translate them into coherent musical compositions. Compared to earlier text-to-music models, Mustango shows significant improvements in terms of overall musical quality and coherence. What can I use it for? Mustango opens up a world of possibilities for content creators, musicians, and hobbyists alike. The model can be used to generate custom background music for videos, podcasts, or video games. Composers could leverage Mustango to quickly prototype musical ideas or explore new creative directions. Advertisers and marketers may find the model useful for generating jingles or soundtracks for their campaigns. Things to try One interesting aspect of Mustango is its ability to generate music in a variety of styles based on the input prompt. Try experimenting with different genres, moods, and levels of detail in your prompts to see the diverse range of musical compositions the model can produce. Additionally, the team has released several pre-trained models, including a Mustango Pretrained version, which may be worth exploring for specific use cases.

Updated Invalid Date

Text-to-Audio

musicgen

meta

2.0K

musicgen is a simple and controllable model for music generation developed by Meta. Unlike existing methods like MusicLM, musicgen doesn't require a self-supervised semantic representation and generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, the authors show they can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. musicgen was trained on 20K hours of licensed music, including an internal dataset of 10K high-quality music tracks and music data from ShutterStock and Pond5. Model inputs and outputs musicgen takes in a text prompt or melody and generates corresponding music. The model's inputs include a description of the desired music, an optional input audio file to influence the generated output, and various parameters to control the generation process like temperature, top-k, and top-p sampling. The output is a generated audio file in WAV format. Inputs Prompt**: A description of the music you want to generate. Input Audio**: An optional audio file that will influence the generated music. If "continuation" is set to true, the generated music will be a continuation of the input audio. Otherwise, it will mimic the input audio's melody. Duration**: The duration of the generated audio in seconds. Continuation Start/End**: The start and end times of the input audio to use for continuation. Various generation parameters**: Settings like temperature, top-k, top-p, etc. to control the diversity and quality of the generated output. Outputs Generated Audio**: A WAV file containing the generated music. Capabilities musicgen can generate a wide variety of music styles and genres based on the provided text prompt. For example, you could ask it to generate "tense, staccato strings with plucked dissonant strings, like a scary movie soundtrack" and it would produce corresponding music. The model can also continue or mimic the melody of an input audio file, allowing for more coherent and controlled music generation. What can I use it for? musicgen could be used for a variety of applications, such as: Background music generation**: Automatically generating custom music for videos, games, or other multimedia projects. Music composition assistance**: Helping musicians and composers come up with new musical ideas or sketches to build upon. Audio creation for content creators**: Allowing YouTubers, podcasters, and other content creators to easily add custom music to their projects. Things to try One interesting aspect of musicgen is its ability to generate music in parallel by predicting the different codebook components separately. This allows for faster generation compared to previous autoregressive music models. You could try experimenting with different generation parameters to find the right balance between generation speed, diversity, and quality for your use case. Additionally, the model's ability to continue or mimic input audio opens up possibilities for interactive music creation workflows, where users could iterate on an initial seed melody or prompt to refine the generated output.

Updated Invalid Date

Text-to-Audio