spleeter

121

Last updated 10/2/2024

🔮

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

spleeter is a source separation library developed by Deezer that can split audio into individual instrument or vocal tracks. It uses a deep learning model trained on a large dataset to isolate different components of a song, such as vocals, drums, bass, and other instruments. This can be useful for tasks like music production, remixing, and audio analysis. Compared to similar models like whisper, speaker-diarization-3.0, and audiosep, spleeter is specifically focused on separating musical sources rather than speech or general audio.

Model inputs and outputs

The spleeter model takes an audio file as input and outputs individual tracks for the different components it has detected. The model is flexible and can separate the audio into 2, 4, or 5 stems, depending on the user's needs.

Inputs

Audio: An audio file in a supported format (e.g. WAV, MP3, FLAC)

Outputs

Separated audio tracks: The input audio separated into individual instrument or vocal tracks, such as:
- Vocals
- Drums
- Bass
- Other instruments

Capabilities

spleeter can effectively isolate the different elements of a complex musical mix, allowing users to manipulate and process the individual components. This can be particularly useful for music producers, sound engineers, and audio enthusiasts who want to access the individual parts of a song for tasks like remixing, sound design, and audio analysis.

What can I use it for?

The spleeter model can be used in a variety of music-related applications, such as:

Music production: Isolate individual instruments or vocals to edit, process, or remix a song.
Karaoke and backing tracks: Extract the vocal stem from a song to create karaoke tracks or backing instrumentals.
Audio analysis: Separate the different components of a song to study their individual characteristics or behavior.
Sound design: Use the isolated instrument tracks to create new sound effects or samples.

Things to try

One interesting thing to try with spleeter is to experiment with the different output configurations (2, 4, or 5 stems) to see how the separation quality and level of detail varies. You can also try applying various audio processing techniques to the isolated tracks, such as EQ, compression, or reverb, to create unique sound effects or explore new creative possibilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

↗️

whisper

soykertje

Whisper is a state-of-the-art speech recognition model developed by OpenAI. It is capable of transcribing audio into text with high accuracy, making it a valuable tool for a variety of applications. The model is implemented as a Cog model by the maintainer soykertje, allowing it to be easily integrated into various projects. Similar models like Whisper, Whisper Diarization, Whisper Large v3, WhisperSpeech Small, and WhisperX Spanish offer different variations and capabilities, catering to diverse speech recognition needs. Model inputs and outputs The Whisper model takes an audio file as input and generates a text transcription of the speech. The model also supports additional options, such as language specification, translation, and adjusting parameters like temperature and patience for the decoding process. Inputs Audio**: The audio file to be transcribed Model**: The specific Whisper model to use Language**: The language spoken in the audio Translate**: Whether to translate the text to English Transcription**: The format for the transcription (e.g., plain text) Temperature**: The temperature to use for sampling Patience**: The patience value to use in beam decoding Suppress Tokens**: A comma-separated list of token IDs to suppress during sampling Word Timestamps**: Whether to include word-level timestamps in the transcription Logprob Threshold**: The threshold for the average log probability to consider the decoding as successful No Speech Threshold**: The threshold for the probability of the token to consider the segment as silence Condition On Previous Text**: Whether to provide the previous output as a prompt for the next window Compression Ratio Threshold**: The threshold for the gzip compression ratio to consider the decoding as successful Temperature Increment On Fallback**: The temperature increase when falling back due to the above thresholds Outputs The transcribed text, with optional formatting and additional information such as word-level timestamps. Capabilities Whisper is a powerful speech recognition model that can accurately transcribe a wide range of audio content, including interviews, lectures, and spontaneous conversations. The model's ability to handle various accents, background noise, and speaker variations makes it a versatile tool for a variety of applications. What can I use it for? The Whisper model can be utilized in a range of applications, such as: Automated transcription of audio recordings for content creators, journalists, or researchers Real-time captioning for video conferencing or live events Voice-to-text conversion for accessibility purposes or hands-free interaction Language translation services, where the transcribed text can be further translated Developing voice-controlled interfaces or intelligent assistants Things to try Experimenting with the various input parameters of the Whisper model can help fine-tune the transcription quality for specific use cases. For example, adjusting the temperature and patience values can influence the model's sampling behavior, leading to more fluent or more conservative transcriptions. Additionally, leveraging the word-level timestamps can enable synchronized subtitles or captions in multimedia applications.

Updated Invalid Date

Audio-to-Text

audiosep

cjwbw

audiosep is a foundation model for open-domain sound separation with natural language queries, developed by cjwbw. It demonstrates strong separation performance and impressive zero-shot generalization ability on numerous tasks such as audio event separation, musical instrument separation, and speech enhancement. audiosep can be compared to similar models like video-retalking, openvoice, voicecraft, whisper-diarization, and depth-anything from the same maintainer, which also focus on audio and video processing tasks. Model inputs and outputs audiosep takes an audio file and a textual description as inputs, and outputs the separated audio based on the provided description. The model processes audio at a 32 kHz sampling rate. Inputs Audio File**: The input audio file to be separated. Text**: The textual description of the audio content to be separated. Outputs Separated Audio**: The output audio file with the requested components separated. Capabilities audiosep can separate a wide range of audio content, from musical instruments to speech and environmental sounds, based on natural language descriptions. It demonstrates impressive zero-shot generalization, allowing users to separate audio in novel ways beyond the training data. What can I use it for? You can use audiosep for a variety of audio processing tasks, such as music production, audio editing, speech enhancement, and audio analytics. The model's ability to separate audio based on natural language descriptions allows for highly customizable and flexible audio manipulation. For example, you could use audiosep to isolate specific instruments in a music recording, remove background noise from a speech recording, or extract environmental sounds from a complex audio scene. Things to try Try using audiosep to separate audio in novel ways, such as isolating a specific sound effect from a movie soundtrack, extracting individual vocals from a choir recording, or separating a specific bird call from a nature recording. The model's flexibility and zero-shot capabilities allow for a wide range of creative and practical applications.

Updated Invalid Date

Audio-to-Audio

sabuhi-model

sabuhigr

The sabuhi-model is an AI model developed by sabuhigr that builds upon the popular Whisper AI model. This model incorporates channel separation and speaker diarization, allowing it to transcribe audio with multiple speakers and distinguish between them. The sabuhi-model can be seen as an extension of similar Whisper-based models like whisper-large-v3, whisper-subtitles, and the original whisper model. It offers additional capabilities for handling multi-speaker audio, making it a useful tool for transcribing interviews, meetings, and other scenarios with multiple participants. Model inputs and outputs The sabuhi-model takes in an audio file, along with several optional parameters to customize the transcription process. These include the choice of Whisper model, a Hugging Face token for speaker diarization, language settings, and various decoding options. Inputs audio**: The audio file to be transcribed model**: The Whisper model to use, with "large-v2" as the default hf_token**: Your Hugging Face token for speaker diarization language**: The language spoken in the audio (can be left as "None" for language detection) translate**: Whether to translate the transcription to English temperature**: The temperature to use for sampling max_speakers**: The maximum number of speakers to detect (default is 1) min_speakers**: The minimum number of speakers to detect (default is 1) transcription**: The format for the transcription (e.g., "plain text") initial_prompt**: Optional text to provide as a prompt for the first window suppress_tokens**: Comma-separated list of token IDs to suppress during sampling logprob_threshold**: The average log probability threshold for considering the decoding as failed no_speech_threshold**: The probability threshold for considering a segment as silence condition_on_previous_text**: Whether to provide the previous output as a prompt for the next window compression_ratio_threshold**: The gzip compression ratio threshold for considering the decoding as failed temperature_increment_on_fallback**: The temperature increment to use when falling back due to threshold issues Outputs The transcribed text, with speaker diarization and other formatting options as specified in the inputs Capabilities The sabuhi-model inherits the core speech recognition capabilities of the Whisper model, but also adds the ability to separate and identify multiple speakers within the audio. This makes it a useful tool for transcribing meetings, interviews, and other scenarios where multiple people are speaking. What can I use it for? The sabuhi-model can be used for a variety of applications that involve transcribing audio with multiple speakers, such as: Generating transcripts for interviews, meetings, or conference calls Creating subtitles or captions for videos with multiple speakers Improving the accessibility of audio-based content by providing text-based alternatives Enabling better search and indexing of audio-based content by generating transcripts Companies working on voice assistants, video conferencing tools, or media production workflows may find the sabuhi-model particularly useful for their needs. Things to try One interesting aspect of the sabuhi-model is its ability to handle audio with multiple speakers and identify who is speaking at any given time. This could be particularly useful for analyzing the dynamics of a conversation, tracking who speaks the most, or identifying the main speakers in a meeting or interview. Additionally, the model's various decoding options, such as the ability to suppress certain tokens or adjust the temperature, provide opportunities to experiment and fine-tune the transcription output to better suit specific use cases or preferences.

Updated Invalid Date

Audio-to-Text

whisper

openai

34.1K

Whisper is a general-purpose speech recognition model developed by OpenAI. It is capable of converting speech in audio to text, with the ability to translate the text to English if desired. Whisper is based on a large Transformer model trained on a diverse dataset of multilingual and multitask speech recognition data. This allows the model to handle a wide range of accents, background noises, and languages. Similar models like whisper-large-v3, incredibly-fast-whisper, and whisper-diarization offer various optimizations and additional features built on top of the core Whisper model. Model inputs and outputs Whisper takes an audio file as input and outputs a text transcription. The model can also translate the transcription to English if desired. The input audio can be in various formats, and the model supports a range of parameters to fine-tune the transcription, such as temperature, patience, and language. Inputs Audio**: The audio file to be transcribed Model**: The specific version of the Whisper model to use, currently only large-v3 is supported Language**: The language spoken in the audio, or None to perform language detection Translate**: A boolean flag to translate the transcription to English Transcription**: The format for the transcription output, such as "plain text" Initial Prompt**: An optional initial text prompt to provide to the model Suppress Tokens**: A list of token IDs to suppress during sampling Logprob Threshold**: The minimum average log probability threshold for a successful transcription No Speech Threshold**: The threshold for considering a segment as silence Condition on Previous Text**: Whether to provide the previous output as a prompt for the next window Compression Ratio Threshold**: The maximum compression ratio threshold for a successful transcription Temperature Increment on Fallback**: The temperature increase when the decoding fails to meet the specified thresholds Outputs Transcription**: The text transcription of the input audio Language**: The detected language of the audio (if language input is None) Tokens**: The token IDs corresponding to the transcription Timestamp**: The start and end timestamps for each word in the transcription Confidence**: The confidence score for each word in the transcription Capabilities Whisper is a powerful speech recognition model that can handle a wide range of accents, background noises, and languages. The model is capable of accurately transcribing audio and optionally translating the transcription to English. This makes Whisper useful for a variety of applications, such as real-time captioning, meeting transcription, and audio-to-text conversion. What can I use it for? Whisper can be used in various applications that require speech-to-text conversion, such as: Captioning and Subtitling**: Automatically generate captions or subtitles for videos, improving accessibility for viewers. Meeting Transcription**: Transcribe audio recordings of meetings, interviews, or conferences for easy review and sharing. Podcast Transcription**: Convert audio podcasts to text, making the content more searchable and accessible. Language Translation**: Transcribe audio in one language and translate the text to another, enabling cross-language communication. Voice Interfaces**: Integrate Whisper into voice-controlled applications, such as virtual assistants or smart home devices. Things to try One interesting aspect of Whisper is its ability to handle a wide range of languages and accents. You can experiment with the model's performance on audio samples in different languages or with various background noises to see how it handles different real-world scenarios. Additionally, you can explore the impact of the different input parameters, such as temperature, patience, and language detection, on the transcription quality and accuracy.

Updated Invalid Date

Audio-to-Text