audiosep

Maintainer: cjwbw

Last updated 9/17/2024

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

audiosep is a foundation model for open-domain sound separation with natural language queries, developed by cjwbw. It demonstrates strong separation performance and impressive zero-shot generalization ability on numerous tasks such as audio event separation, musical instrument separation, and speech enhancement. audiosep can be compared to similar models like video-retalking, openvoice, voicecraft, whisper-diarization, and depth-anything from the same maintainer, which also focus on audio and video processing tasks.

Model inputs and outputs

audiosep takes an audio file and a textual description as inputs, and outputs the separated audio based on the provided description. The model processes audio at a 32 kHz sampling rate.

Inputs

Audio File: The input audio file to be separated.
Text: The textual description of the audio content to be separated.

Outputs

Separated Audio: The output audio file with the requested components separated.

Capabilities

audiosep can separate a wide range of audio content, from musical instruments to speech and environmental sounds, based on natural language descriptions. It demonstrates impressive zero-shot generalization, allowing users to separate audio in novel ways beyond the training data.

What can I use it for?

You can use audiosep for a variety of audio processing tasks, such as music production, audio editing, speech enhancement, and audio analytics. The model's ability to separate audio based on natural language descriptions allows for highly customizable and flexible audio manipulation. For example, you could use audiosep to isolate specific instruments in a music recording, remove background noise from a speech recording, or extract environmental sounds from a complex audio scene.

Things to try

Try using audiosep to separate audio in novel ways, such as isolating a specific sound effect from a movie soundtrack, extracting individual vocals from a choir recording, or separating a specific bird call from a nature recording. The model's flexibility and zero-shot capabilities allow for a wide range of creative and practical applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

demucs

cjwbw

130

demucs is a state-of-the-art music source separation model developed by researchers at Facebook AI Research. It is capable of separating drums, bass, vocals, and other accompaniment from audio tracks. The latest version, Hybrid Transformer Demucs (v4), uses a hybrid spectrogram and waveform architecture with a Transformer encoder-decoder to achieve high-quality separation performance. This builds on the previous Hybrid Demucs (v3) model, which won the Sony MDX challenge. demucs is similar to other advanced source separation models like Wave-U-Net, Open-Unmix, and D3Net, but achieves state-of-the-art results on standard benchmarks. Model inputs and outputs demucs takes as input an audio file in a variety of formats including WAV, MP3, FLAC, and more. It outputs the separated audio stems for drums, bass, vocals, and other accompaniment as individual stereo WAV or MP3 files. Users can also choose to output just the vocals or other specific stems. Inputs audio**: The input audio file to be separated stem**: The specific stem to separate (e.g. vocals, drums, bass) or "no_stem" to separate all stems model_name**: The pre-trained model to use for separation, such as htdemucs, htdemucs_ft, or mdx_extra shifts**: The number of random shifts to use for equivariant stabilization, which improves quality but increases inference time overlap**: The amount of overlap between prediction windows clip_mode**: The strategy for avoiding clipping in the output, either "rescale" or "clamp" float32**: Whether to output the audio as 32-bit float instead of 16-bit integer mp3_bitrate**: The bitrate to use when outputting the audio as MP3 Outputs drums.wav**: The separated drums stem bass.wav**: The separated bass stem vocals.wav**: The separated vocals stem other.wav**: The separated other/accompaniment stem Capabilities demucs is a highly capable music source separation model that can extract individual instrument and vocal tracks from complex audio mixes with high accuracy. It outperforms many previous state-of-the-art models on standard benchmarks like the MUSDB18 dataset. The latest Hybrid Transformer Demucs (v4) model achieves 9.0 dB SDR, which is a significant improvement over earlier versions and other leading approaches. What can I use it for? demucs can be used for a variety of music production and audio engineering tasks. It enables users to isolate individual elements of a song, which is useful for tasks like: Karaoke or music removal - Extracting just the vocals to create a karaoke track Remixing or mash-ups - Separating the drums, bass, and other elements to remix a song Audio post-production - Cleaning up or enhancing specific elements of a mix Music education - Isolating instrument tracks for practicing or study Music information retrieval - Analyzing the individual components of a song The model's state-of-the-art performance and flexible interface make it a powerful tool for both professionals and hobbyists working with audio. Things to try Some interesting things to try with demucs include: Experimenting with the different pre-trained models to find the best fit for your audio Trying the "two-stems" mode to extract just the vocals or other specific element Utilizing the "shifts" option to improve separation quality, especially for complex mixes Applying the model to a diverse range of musical genres and styles to see how it performs The maintainer, cjwbw, has also released several other impressive audio models like audiosep, video-retalking, and voicecraft that may be of interest to explore further.

Updated Invalid Date

Audio-to-Audio

📊

openvoice

cjwbw

The openvoice model, developed by the team at MyShell, is a versatile instant voice cloning AI that can accurately clone the tone color and generate speech in multiple languages and accents. It offers flexible control over voice styles, such as emotion and accent, as well as other style parameters like rhythm, pauses, and intonation. The model also supports zero-shot cross-lingual voice cloning, allowing it to generate speech in languages not present in the training dataset. The openvoice model builds upon several excellent open-source projects, including TTS, VITS, and VITS2. It has been powering the instant voice cloning capability of myshell.ai since May 2023 and has been used tens of millions of times by users worldwide, witnessing explosive growth on the platform. Model inputs and outputs Inputs Audio**: The reference audio used to clone the tone color. Text**: The text to be spoken by the cloned voice. Speed**: The speed scale of the output audio. Language**: The language of the audio to be generated. Outputs Output**: The generated audio in the cloned voice. Capabilities The openvoice model excels at accurate tone color cloning, flexible voice style control, and zero-shot cross-lingual voice cloning. It can generate speech in multiple languages and accents, while allowing for granular control over voice styles, including emotion and accent, as well as other parameters like rhythm, pauses, and intonation. What can I use it for? The openvoice model can be used for a variety of applications, such as: Instant voice cloning for audio, video, or gaming content Customized text-to-speech for assistants, chatbots, or audiobooks Multilingual voice acting and dubbing Voice conversion and style transfer Things to try With the openvoice model, you can experiment with different input reference audios to clone a wide range of voices and accents. You can also play with the style parameters to create unique and expressive speech outputs. Additionally, you can explore the model's cross-lingual capabilities by generating speech in languages not present in the training data.

Updated Invalid Date

Text-to-Audio

video-retalking

cjwbw

video-retalking is a system developed by researchers at Tencent AI Lab and Xidian University that enables audio-based lip synchronization and expression editing for talking head videos. It builds on prior work like Wav2Lip, PIRenderer, and GFP-GAN to create a pipeline for generating high-quality, lip-synced videos from talking head footage and audio. Unlike models like voicecraft, which focus on speech editing, or tokenflow, which aims for consistent video editing, video-retalking is specifically designed for synchronizing lip movements with audio. Model inputs and outputs video-retalking takes two main inputs: a talking head video and an audio file. The model then generates a new video with the facial expressions and lip movements synchronized to the provided audio. This allows users to edit the appearance and emotion of a talking head video while preserving the original audio. Inputs Face**: Input video file of a talking-head. Input Audio**: Input audio file to synchronize with the video. Outputs Output**: The generated video with synchronized lip movements and expressions. Capabilities video-retalking can generate high-quality, lip-synced videos even in the wild, meaning it can handle real-world footage without the need for extensive pre-processing or manual alignment. The model is capable of disentangling the task into three key steps: generating a canonical face expression, synchronizing the lip movements to the audio, and enhancing the photo-realism of the final output. What can I use it for? video-retalking can be a powerful tool for content creators, video editors, and anyone looking to edit or enhance talking head videos. Its ability to preserve the original audio while modifying the visual elements opens up possibilities for a wide range of applications, such as: Dubbing or re-voicing videos in different languages Adjusting the emotion or expression of a speaker Repairing or improving the lip sync in existing footage Creating animated avatars or virtual presenters Things to try One interesting aspect of video-retalking is its ability to control the expression of the upper face using pre-defined templates like "smile" or "surprise". This allows for more nuanced expression editing beyond just lip sync. Additionally, the model's sequential pipeline means each step can be examined and potentially fine-tuned for specific use cases.

Updated Invalid Date

Video-to-Audio

🛸

sadtalker

cjwbw

sadtalker is an AI model developed by researchers at Tencent AI Lab and Xi'an Jiaotong University that enables stylized audio-driven single image talking face animation. It extends the popular video-retalking model, which focuses on audio-based lip synchronization for talking head video editing. sadtalker takes this a step further by generating a 3D talking head animation from a single portrait image and an audio clip. Model inputs and outputs sadtalker takes two main inputs: a source image (which can be a still image or a short video) and an audio clip. The model then generates a talking head video that animates the person in the source image to match the audio. This can be used to create expressive, stylized talking head videos from just a single photo. Inputs Source Image**: The portrait image or short video that will be animated Driven Audio**: The audio clip that will drive the facial animation Outputs Talking Head Video**: An animated video of the person in the source image speaking in sync with the driven audio Capabilities sadtalker is capable of generating realistic 3D facial animations from a single portrait image and an audio clip. The animations capture natural head pose, eye blinks, and lip sync, resulting in a stylized talking head video. The model can handle a variety of facial expressions and is able to preserve the identity of the person in the source image. What can I use it for? sadtalker can be used to create custom talking head videos for a variety of applications, such as: Generating animated content for games, films, or virtual avatars Creating personalized videos for marketing, education, or entertainment Dubbing or re-voicing existing videos with new audio Animating portraits or headshots to add movement and expression The model's ability to work from a single image input makes it particularly useful for quickly creating talking head content without the need for complex 3D modeling or animation workflows. Things to try Some interesting things to experiment with using sadtalker include: Trying different source images, from portraits to more stylized or cartoon-like illustrations, to see how the model handles various artistic styles Combining sadtalker with other AI models like stable-diffusion to generate entirely new talking head characters Exploring the model's capabilities with different types of audio, such as singing, accents, or emotional speech Integrating sadtalker into larger video or animation pipelines to streamline content creation The versatility and ease of use of sadtalker make it a powerful tool for anyone looking to create expressive, personalized talking head videos.

Updated Invalid Date

Audio-to-Image