salmonn

Maintainer: nateraw

Last updated 7/2/2024

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

SALMONN is a large language model (LLM) developed by the Department of Electronic Engineering at Tsinghua University and ByteDance. Unlike traditional speech-only or audio-event-only models, SALMONN can perceive and understand a variety of audio inputs, including speech, audio events, and music. This multi-modal capability allows SALMONN to perform tasks like multilingual speech recognition, translation, and audio-speech co-reasoning, making it a step towards hearing-enabled artificial general intelligence.

SALMONN builds on models like Whisper, a general-purpose speech recognition model, and Parakeet RNNT, a high-accuracy and efficient speech-to-text conversion system. However, SALMONN extends these capabilities by fusing speech, audio, and language processing into a single, versatile model.

Model inputs and outputs

Inputs

wav_path: The path to an audio file up to 30 seconds long.
prompt: A text prompt related to the audio file.

Outputs

Text response: The model's response to the given audio file and prompt.

Capabilities

SALMONN can perform a wide range of audio-related tasks, leveraging the general knowledge and cognitive abilities of the LLM. This includes tasks like:

Transcribing and translating speech in multiple languages
Recognizing and describing audio events and sounds
Analyzing and generating music-related content
Answering open-ended questions about the audio inputs

Unlike traditional speech and audio processing models, SALMONN can go beyond simple recognition and processing tasks, and engage in more cognitively oriented audio perception. This dramatically improves the versatility and richness of the model's capabilities.

What can I use it for?

With its multi-modal capabilities, SALMONN can be applied to a wide range of projects and use cases, such as:

Developing smart home and assistive technologies that can understand and respond to spoken commands and audio events
Building language learning and translation applications that can leverage audio input
Creating intelligent music production and analysis tools
Enhancing video and audio editing workflows with intelligent audio processing
Powering conversational AI agents with the ability to understand and reason about audio

The maintainer's profile also showcases other related models, such as OpenChat and Goliath-120B, that may be of interest for similar applications.

Things to try

One interesting aspect of SALMONN is its ability to understand and respond to spoken commands, even though the model was only trained on textual prompts. This cross-modal emergent capability demonstrates the model's potential to go beyond traditional speech recognition and engage in more natural, human-like interaction.

You can try providing SALMONN with both text prompts and audio clips, and see how the model responds. For example, you could ask it to "Describe the sounds in this audio file" or "Translate this spoken phrase into English". The model's versatility and cognitive abilities will be on full display as it processes and reasons about the multi-modal inputs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

parakeet-rnnt-1.1b

nvlabs

The parakeet-rnnt-1.1b is an advanced speech recognition model developed by NVIDIA and Suno.ai. It features the FastConformer architecture and is available in both RNNT and CTC versions, making it well-suited for transcribing English speech in noisy audio environments while maintaining accuracy in silent segments. This model outperforms the popular OpenAI Whisper model on the Open ASR Leaderboard, reclaiming the top spot for speech recognition accuracy. Model inputs and outputs Inputs audio_file**: The input audio file to be transcribed by the ASR model, in a supported audio format. Outputs Output**: The transcribed text output from the speech recognition model. Capabilities The parakeet-rnnt-1.1b model is capable of high-accuracy speech transcription, particularly in challenging audio environments. It has been trained on a diverse 65,000-hour dataset, enabling robust performance across a variety of use cases. Compared to the OpenAI Whisper model, the parakeet-rnnt-1.1b achieves lower Word Error Rates (WER) on benchmarks like AMI, Earnings22, Gigaspeech, and Common Voice 9. What can I use it for? The parakeet-rnnt-1.1b model is designed for precision ASR tasks in voice recognition and transcription, making it suitable for a range of applications such as voice-to-text conversion, meeting minutes generation, and closed captioning. It can be integrated into the NeMo toolkit for a broader set of use cases. However, users should be mindful of data privacy and potential biases in speech recognition, ensuring fair and responsible use of the technology. Things to try Experimenting with the parakeet-rnnt-1.1b model in various audio scenarios, such as noisy environments or recordings with silent segments, can help evaluate its performance and suitability for specific use cases. Additionally, testing the model's accuracy and efficiency on different benchmarks can provide valuable insights into its capabilities.

Updated Invalid Date

Audio-to-Text

whisper-large-v3

nateraw

The whisper-large-v3 model is a general-purpose speech recognition model developed by OpenAI. It is a large Transformer-based model trained on a diverse dataset of audio data, allowing it to perform multilingual speech recognition, speech translation, and language identification. The model is highly capable and can transcribe speech across a wide range of languages, although its performance varies based on the specific language. Similar models like incredibly-fast-whisper, whisper-diarization, and whisperx-a40-large offer various optimizations and additional features built on top of the base whisper-large-v3 model. Model inputs and outputs The whisper-large-v3 model takes in audio files and can perform speech recognition, transcription, and translation tasks. It supports a wide range of input audio formats, including common formats like FLAC, MP3, and WAV. The model can identify the source language of the audio and optionally translate the transcribed text into English. Inputs Filepath**: Path to the audio file to transcribe Language**: The source language of the audio, if known (e.g., "English", "French") Translate**: Whether to translate the transcribed text to English Outputs The transcribed text from the input audio file Capabilities The whisper-large-v3 model is a highly capable speech recognition model that can handle a diverse range of audio data. It demonstrates strong performance across many languages, with the ability to identify the source language and optionally translate the transcribed text to English. The model can also perform tasks like speaker diarization and generating word-level timestamps, as showcased by similar models like whisper-diarization and whisperx-a40-large. What can I use it for? The whisper-large-v3 model can be used for a variety of applications that involve transcribing speech, such as live captioning, audio-to-text conversion, and language learning. It can be particularly useful for transcribing multilingual audio, as it can identify the source language and provide accurate transcriptions. Additionally, the model's ability to translate the transcribed text to English opens up opportunities for cross-lingual communication and accessibility. Things to try One interesting aspect of the whisper-large-v3 model is its ability to handle a wide range of audio data, from high-quality studio recordings to low-quality field recordings. You can experiment with different types of audio input and observe how the model's performance varies. Additionally, you can try using the model's language identification capabilities to transcribe audio in unfamiliar languages and explore its translation functionality to bridge language barriers.

Updated Invalid Date

Audio-to-Text

openchat_3.5-awq

nateraw

101

openchat_3.5-awq is an innovative open-source language model developed by Replicate's nateraw. It is part of the OpenChat library, which includes a series of high-performing models fine-tuned using a strategy called C-RLFT (Contextual Reinforcement Learning from Feedback). This approach allows the models to learn from mixed-quality data without explicit preference labels, delivering exceptional performance on par with ChatGPT despite being a relatively compact 7B model. The OpenChat models outperform other open-source alternatives like OpenHermes 2.5, OpenOrca Mistral, and Zephyr-β on various benchmarks, including reasoning, coding, and mathematical tasks. The latest version, openchat_3.5-0106, even surpasses the capabilities of ChatGPT (March) and Grok-1 on several key metrics. Model Inputs and Outputs Inputs prompt**: The input text prompt for the model to generate a response. max_new_tokens**: The maximum number of tokens the model should generate as output. temperature**: The value used to modulate the next token probabilities. top_p**: A probability threshold for generating the output. If = top_p (nucleus filtering). top_k**: The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering). prompt_template**: The template used to format the prompt. The input prompt is inserted into the template using the {prompt} placeholder. presence_penalty**: The penalty applied to tokens based on their presence in the generated text. frequency_penalty**: The penalty applied to tokens based on their frequency in the generated text. Outputs The model generates a sequence of tokens as output, which can be concatenated to form the model's response. Capabilities openchat_3.5-awq demonstrates strong performance in a variety of tasks, including: Reasoning and Coding**: The model outperforms ChatGPT (March) and other open-source alternatives on coding and reasoning benchmarks like HumanEval, BBH MC, and AGIEval. Mathematical Reasoning**: The model achieves state-of-the-art results on mathematical reasoning tasks like GSM8K, showcasing its ability to tackle complex numerical problems. General Language Understanding**: The model performs well on MMLU, a broad benchmark for general language understanding, indicating its versatility in handling diverse language tasks. What Can I Use It For? The openchat_3.5-awq model can be leveraged for a wide range of applications, such as: Conversational AI**: The model can be deployed as a conversational agent, engaging users in natural language interactions and providing helpful responses. Content Generation**: The model can be used to generate high-quality text, such as articles, stories, or creative writing, by fine-tuning on specific domains or datasets. Task-oriented Dialogue**: The model can be fine-tuned for task-oriented dialogues, such as customer service, technical support, or virtual assistance. Code Generation**: The model's strong performance on coding tasks makes it a valuable tool for automating code generation, programming assistance, or code synthesis. Things to Try Here are some ideas for what you can try with openchat_3.5-awq: Explore the model's capabilities**: Test the model on a variety of tasks, such as open-ended conversations, coding challenges, or mathematical problems, to understand its strengths and limitations. Fine-tune the model**: Leverage the model's strong foundation by fine-tuning it on your specific dataset or domain to create a customized language model for your applications. Combine with other technologies**: Integrate the model with other AI or automation tools, such as voice interfaces or robotic systems, to create more comprehensive and intelligent solutions. Contribute to the open-source ecosystem**: As an open-source model, you can explore ways to improve or extend the OpenChat library, such as by contributing to the codebase, providing feedback, or collaborating on research and development.

Updated Invalid Date

Text-to-Text

audio-super-resolution

nateraw

audio-super-resolution is a versatile audio super-resolution model developed by Replicate creator nateraw. It is capable of upscaling various types of audio, including music, speech, and environmental sounds, to higher fidelity across different sampling rates. This model can be seen as complementary to other audio-focused models like whisper-large-v3, which focuses on speech recognition, and salmonn, which handles a broader range of audio tasks. Model inputs and outputs audio-super-resolution takes in an audio file and generates an upscaled version of the input. The model supports both single file processing and batch processing of multiple audio files. Inputs Input Audio File**: The audio file to be upscaled, which can be in various formats. Input File List**: A file containing a list of audio files to be processed in batch. Outputs Upscaled Audio File**: The super-resolved version of the input audio, saved in the specified output directory. Capabilities audio-super-resolution can handle a wide variety of audio types, from music and speech to environmental sounds, and it can work with different sampling rates. The model is capable of enhancing the fidelity and quality of the input audio, making it a useful tool for tasks such as audio restoration, content creation, and audio post-processing. What can I use it for? The audio-super-resolution model can be leveraged in various applications where high-quality audio is required, such as music production, podcast editing, sound design, and audio archiving. By upscaling lower-quality audio files, users can create more polished and professional-sounding audio content. Additionally, the model's versatility makes it suitable for use in creative projects, content creation workflows, and audio-related research and development. Things to try To get started with audio-super-resolution, you can experiment with processing both individual audio files and batches of files. Try using the model on a variety of audio types, such as music, speech, and environmental sounds, to see how it performs. Additionally, you can adjust the model's parameters, such as the DDIM steps and guidance scale, to explore the trade-offs between audio quality and processing time.

Updated Invalid Date

Audio-to-Audio