parakeet-rnnt-1.1b

Maintainer: nvidia

Last updated 5/28/2024

🛸

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model Overview

The parakeet-rnnt-1.1b is an ASR (Automatic Speech Recognition) model developed jointly by the NVIDIA NeMo and Suno.ai teams. It uses the FastConformer Transducer architecture, which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. This XXL model has around 1.1 billion parameters and can transcribe speech in lower case English alphabet with high accuracy.

The model is similar to other high-performing ASR models like Canary-1B, which also uses the FastConformer architecture but supports multiple languages. In contrast, the parakeet-rnnt-1.1b is focused solely on English speech transcription.

Model Inputs and Outputs

Inputs

16000 Hz mono-channel audio (WAV files)

Outputs

Transcribed speech as a string for a given audio sample

Capabilities

The parakeet-rnnt-1.1b model demonstrates state-of-the-art performance on English speech recognition tasks. It was trained on a large, diverse dataset of 85,000 hours of speech data from various public and private sources, including LibriSpeech, Fisher Corpus, Switchboard, and more.

What Can I Use It For?

The parakeet-rnnt-1.1b model is well-suited for a variety of speech-to-text applications, such as voice transcription, dictation, and audio captioning. It could be particularly useful in scenarios where high-accuracy English speech recognition is required, such as in media production, customer service, or educational applications.

Things to Try

One interesting aspect of the parakeet-rnnt-1.1b model is its ability to handle a wide range of audio inputs, from clear studio recordings to noisier real-world audio. You could experiment with feeding it different types of audio samples and observe how it performs in terms of transcription accuracy and robustness.

Additionally, since the model was trained on a large and diverse dataset, you could try fine-tuning it on a more specialized domain or genre of audio to see if you can further improve its performance for your specific use case.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🤯

parakeet-tdt-1.1b

nvidia

The parakeet-tdt-1.1b is an ASR (Automatic Speech Recognition) model that transcribes speech in lower case English alphabet. This model is jointly developed by the NVIDIA NeMo and Suno.ai teams. It uses a FastConformer-TDT architecture, which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model has around 1.1 billion parameters. Similar models include the parakeet-rnnt-1.1b, which is also a large ASR model developed by NVIDIA and Suno.ai. It uses a FastConformer Transducer architecture and has similar performance characteristics. Model inputs and outputs Inputs 16000 Hz mono-channel audio (wav files) as input Outputs Transcribed speech as a string for a given audio sample Capabilities The parakeet-tdt-1.1b model is capable of transcribing English speech with high accuracy. It was trained on a large corpus of speech data, including 64K hours of English speech from various public and private datasets. What can I use it for? You can use the parakeet-tdt-1.1b model for a variety of speech-to-text applications, such as transcribing audio recordings, live speech recognition, or integrating it into your own voice-enabled products and services. The model can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset using the NVIDIA NeMo toolkit. Things to try One interesting thing to try with the parakeet-tdt-1.1b model is to experiment with fine-tuning it on a specific domain or dataset. This could help improve the model's performance on your particular use case. You could also try combining the model with other components, such as language models or audio preprocessing modules, to further enhance its capabilities.

Updated Invalid Date

Text-to-Text

👁️

canary-1b

nvidia

191

The canary-1b model is a part of the NVIDIA NeMo Canary family of multi-lingual, multi-tasking models. With 1 billion parameters, the Canary-1B model supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). The model uses a FastConformer-Transformer encoder-decoder architecture. Model inputs and outputs Inputs Audio files or a jsonl manifest file containing audio data Outputs Transcribed text in the specified language (English, German, French, Spanish) Translated text to/from the specified language pair Capabilities The Canary-1B model demonstrates state-of-the-art performance on multiple benchmarks for ASR and translation tasks in the supported languages. It can handle various accents, background noise, and technical language well. What can I use it for? The canary-1b model is well-suited for research on robust, multi-lingual speech recognition and translation. It can also be fine-tuned on specific datasets to improve performance for particular domains or applications. Developers may find it useful as a pre-trained model for building ASR or translation tools, especially for the supported languages. Things to try You can experiment with the canary-1b model by loading it using the NVIDIA NeMo toolkit. Try transcribing or translating audio samples in different languages, and compare the results to your expectations or other models. You can also fine-tune the model on your own data to see how it performs on specific tasks or domains.

Updated Invalid Date

Text-to-Text

parakeet-rnnt-1.1b

nvlabs

The parakeet-rnnt-1.1b is an advanced speech recognition model developed by NVIDIA and Suno.ai. It features the FastConformer architecture and is available in both RNNT and CTC versions, making it well-suited for transcribing English speech in noisy audio environments while maintaining accuracy in silent segments. This model outperforms the popular OpenAI Whisper model on the Open ASR Leaderboard, reclaiming the top spot for speech recognition accuracy. Model inputs and outputs Inputs audio_file**: The input audio file to be transcribed by the ASR model, in a supported audio format. Outputs Output**: The transcribed text output from the speech recognition model. Capabilities The parakeet-rnnt-1.1b model is capable of high-accuracy speech transcription, particularly in challenging audio environments. It has been trained on a diverse 65,000-hour dataset, enabling robust performance across a variety of use cases. Compared to the OpenAI Whisper model, the parakeet-rnnt-1.1b achieves lower Word Error Rates (WER) on benchmarks like AMI, Earnings22, Gigaspeech, and Common Voice 9. What can I use it for? The parakeet-rnnt-1.1b model is designed for precision ASR tasks in voice recognition and transcription, making it suitable for a range of applications such as voice-to-text conversion, meeting minutes generation, and closed captioning. It can be integrated into the NeMo toolkit for a broader set of use cases. However, users should be mindful of data privacy and potential biases in speech recognition, ensuring fair and responsible use of the technology. Things to try Experimenting with the parakeet-rnnt-1.1b model in various audio scenarios, such as noisy environments or recordings with silent segments, can help evaluate its performance and suitability for specific use cases. Additionally, testing the model's accuracy and efficiency on different benchmarks can provide valuable insights into its capabilities.

Updated Invalid Date

Audio-to-Text

⚙️

speakerverification_en_titanet_large

nvidia

The speakerverification_en_titanet_large model is a speaker embedding extraction model developed by NVIDIA. It is a "large" version of the TitaNet model, with around 23 million parameters. The model can be used as the backbone for speaker verification and diarization tasks, extracting speaker embeddings from audio input. The model is available for use in the NVIDIA NeMo toolkit, and can be used as a pre-trained checkpoint for inference or fine-tuning. Similar models include the parakeet-rnnt-1.1b and parakeet-tdt-1.1b models, which are large ASR models developed by NVIDIA and Suno.ai. Model inputs and outputs Inputs 16000 kHz Mono-channel Audio (wav files) Outputs Speaker embeddings for an audio file Capabilities The speakerverification_en_titanet_large model can extract speaker embeddings from audio input, which are useful for speaker verification and diarization tasks. For example, the model can be used to verify if two audio files are from the same speaker or not. What can I use it for? The speaker embeddings produced by the speakerverification_en_titanet_large model can be used in a variety of applications, such as speaker identification, speaker diarization, and voice biometrics. These embeddings can be used as input to downstream models for tasks like speaker verification, where the goal is to determine if two audio samples are from the same speaker. Things to try One interesting thing to try with the speakerverification_en_titanet_large model is to use it for large-scale speaker diarization. By extracting speaker embeddings for each audio segment and clustering them, you can automatically identify the different speakers in a multi-speaker audio recording. This could be useful for applications like meeting transcription or content moderation.

Updated Invalid Date

Audio-to-Audio