whisper-large-v3-turbo

Maintainer: openai

412

Last updated 10/4/2024

🏅

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The whisper-large-v3-turbo model is a finetuned version of the pruned Whisper large-v3 model. It is the exact same model, except that the number of decoding layers have been reduced from 32 to 4, making the model significantly faster while only experiencing a minor quality degradation. The Whisper model was proposed by Alec Radford et al. from OpenAI and demonstrates strong generalization across many datasets and domains in a zero-shot setting.

Model inputs and outputs

The whisper-large-v3-turbo model is designed for automatic speech recognition (ASR) and speech translation. It takes audio samples as input and outputs text transcriptions.

Inputs

Audio samples: The model accepts arbitrary length audio inputs, which it can process efficiently using a chunked inference algorithm.

Outputs

Text transcriptions: The model outputs text transcriptions of the input audio, either in the same language as the audio (for ASR) or in a different language (for speech translation).
Timestamps: The model can optionally provide timestamps for each transcribed sentence or word.

Capabilities

The whisper-large-v3-turbo model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to transcribe audio in one language and output the text in a different language.

What can I use it for?

The whisper-large-v3-turbo model is primarily intended for AI researchers studying the capabilities, biases, and limitations of large language models. However, it can also be a useful ASR solution for developers, especially for English speech recognition tasks. The speed and accuracy of the model suggest that others may be able to build applications on top of it that allow for near-real-time speech recognition and translation.

Things to try

One key capability to explore with the whisper-large-v3-turbo model is its ability to handle long-form audio. By using the chunked inference algorithm provided in the Transformers library, the model can efficiently transcribe audio files of arbitrary length. Developers could experiment with using this feature to build applications that provide accurate transcriptions of podcasts, interviews, or other long-form audio content.

Another interesting aspect to investigate is the model's performance on non-English languages and its zero-shot translation capabilities. Users could try transcribing audio in different languages and evaluating the quality of the translations to English, as well as exploring ways to fine-tune the model for specific language pairs or domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🏅

whisper-large-v3-turbo

ylacombe

412

The whisper-large-v3-turbo model is a finetuned version of the Whisper large-v3 model, a state-of-the-art automatic speech recognition (ASR) and speech translation model proposed by Alec Radford et al. from OpenAI. Trained on over 5 million hours of labeled data, Whisper demonstrates strong generalization to many datasets and domains without the need for fine-tuning. The whisper-large-v3-turbo model has a reduced number of decoding layers from 32 to 4, resulting in a faster model but with a minor quality degradation. Model inputs and outputs The whisper-large-v3-turbo model takes audio samples as input and generates transcribed text as output. It can be used for both speech recognition, where the output is in the same language as the input audio, as well as speech translation, where the output is in a different language. Inputs Audio samples**: The model accepts raw audio waveforms sampled at 16kHz or 44.1kHz. Outputs Transcribed text**: The model generates text transcriptions of the input audio. Timestamps (optional)**: The model can also generate timestamps indicating the start and end time of each transcribed segment. Capabilities The Whisper models demonstrate strong performance on speech recognition and translation tasks, exhibiting improved robustness to accents, background noise, and technical language compared to many existing ASR systems. The models can also perform zero-shot translation from multiple languages into English. What can I use it for? The whisper-large-v3-turbo model can be useful for a variety of applications, such as: Transcription and translation**: The model can be used to transcribe audio in various languages and translate it to English or other target languages. Accessibility tools**: The model's transcription capabilities can be leveraged to improve accessibility, such as live captioning or subtitling for audio/video content. Voice interaction and assistants**: The model's ASR and translation abilities can be integrated into voice-based interfaces and digital assistants. Things to try One interesting aspect of the Whisper models is their ability to automatically determine the language of the input audio and perform the appropriate task (recognition or translation) without any additional prompting. You can experiment with this by providing audio samples in different languages and observing how the model handles the task. Additionally, the models support returning word-level timestamps, which can be useful for applications that require precise alignment between the transcribed text and the audio. Try using the return_timestamps="word" parameter to see the word-level timing information.

Updated Invalid Date

Text-to-Text

🎲

whisper-large-v3

openai

2.6K

The whisper-large-v3 model is a general-purpose speech recognition model developed by OpenAI. It is the latest version of the Whisper model, building on the previous Whisper large models. The whisper-large-v3 model has a few minor architectural differences from the previous large models, including using 128 Mel frequency bins instead of 80 and adding a new language token for Cantonese. The Whisper model was trained on a massive 680,000 hours of audio data, with 65% English data, 18% non-English data with English transcripts, and 17% non-English data with non-English transcripts covering 98 languages. This allows the model to perform well on a diverse range of speech recognition and translation tasks, without needing to fine-tune on specific datasets. Similar Whisper models include the Whisper medium, Whisper tiny, and the whisper-large-v3 model developed by Nate Raw. There is also an incredibly fast version of the Whisper large model by Vaibhav Srivastav. Model inputs and outputs The whisper-large-v3 model takes audio samples as input and generates text transcripts as output. The audio can be in any of the 98 languages covered by the training data. The model can also be used for speech translation, where it generates text in a different language than the audio. Inputs Audio samples in any of the 98 languages the model was trained on Outputs Text transcripts of the audio in the same language Translated text transcripts in a different language Capabilities The whisper-large-v3 model demonstrates strong performance on a variety of speech recognition and translation tasks, with 10-20% lower error rates compared to the previous Whisper large model. It is robust to accents, background noise, and technical language, and can perform zero-shot translation from multiple languages into English. However, the model's performance is uneven across languages, with lower accuracy on low-resource and low-discoverability languages where less training data was available. It also has a tendency to generate repetitive or hallucinated text that is not actually present in the audio input. What can I use it for? The primary intended use of the Whisper models is for AI researchers studying model capabilities, robustness, and limitations. However, the models can also be quite useful as a speech recognition solution for developers, especially for English transcription tasks. The Whisper models could be used to build applications that improve accessibility, such as closed captioning or voice-to-text transcription. While the models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build near-real-time applications on top of them. Things to try One interesting aspect of the Whisper models is their ability to perform speech translation, generating text transcripts in a different language than the audio input. Developers could experiment with using the model for tasks like simultaneous interpretation or multilingual subtitling. Another avenue to explore is fine-tuning the pre-trained Whisper model on specific datasets or domains. The blog post Fine-Tune Whisper with Transformers provides a guide on how to fine-tune the model with as little as 5 hours of labeled data, which can improve performance on particular languages or use cases.

Updated Invalid Date

Text-to-Text

🔎

whisper-large

openai

438

The whisper-large model is a pre-trained AI model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. Trained on 680k hours of labelled data, the Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The whisper-large-v2 model is a newer version that surpasses the performance of the original whisper-large model, with no architecture changes. The whisper-medium model is a slightly smaller version with 769M parameters, while the whisper-tiny model is the smallest at 39M parameters. All of these Whisper models are available on the Hugging Face Hub. Model inputs and outputs Inputs Audio samples, which the model converts to log-Mel spectrograms Outputs Textual transcriptions of the input audio, either in the same language as the audio (for speech recognition) or in a different language (for speech translation) The model can also output timestamps for the transcriptions Capabilities The Whisper models demonstrate strong performance on a variety of speech recognition and translation tasks, exhibiting improved robustness to accents, background noise, and technical language. They can also perform zero-shot translation from multiple languages into English. However, the models may occasionally produce text that is not actually spoken in the audio input, a phenomenon known as "hallucination". Their performance also varies across languages, with lower accuracy on low-resource and less common languages. What can I use it for? The Whisper models are primarily intended for use by AI researchers studying model robustness, generalization, capabilities, biases, and constraints. However, the models can also be useful for developers looking to build speech recognition or translation applications, especially for English speech. The models' speed and accuracy make them well-suited for applications that require transcription or translation of large volumes of audio data, such as accessibility tools, media production, and language learning. Developers can build applications on top of the models to enable near-real-time speech recognition and translation. Things to try One interesting aspect of the Whisper models is their ability to perform long-form transcription of audio samples longer than 30 seconds. This is achieved through a chunking algorithm that allows the model to process audio of arbitrary length. Another unique feature is the model's ability to automatically detect the language of the input audio and perform the appropriate speech recognition or translation task. Developers can leverage this by providing the model with "context tokens" that inform it of the desired task and language. Finally, the pre-trained Whisper models can be fine-tuned on smaller datasets to further improve their performance on specific languages or domains. The Fine-Tune Whisper with Transformers blog post provides a step-by-step guide on how to do this.

Updated Invalid Date

Audio-to-Text

🤯

whisper-large-v2

openai

1.6K

The whisper-large-v2 model is a pre-trained Transformer-based encoder-decoder model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labeled data by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Compared to the original Whisper large model, the whisper-large-v2 model has been trained for 2.5x more epochs with added regularization for improved performance. Model inputs and outputs Inputs Audio samples**: The model takes audio samples as input and performs either speech recognition or speech translation. Outputs Text transcription**: The model outputs text transcriptions of the input audio. For speech recognition, the transcription is in the same language as the audio. For speech translation, the transcription is in a different language than the audio. Timestamps (optional)**: The model can optionally output timestamps for the transcribed text. Capabilities The whisper-large-v2 model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to translate speech from multiple languages into English with high accuracy. What can I use it for? The whisper-large-v2 model can be a useful tool for developers building speech recognition and translation applications. Its strong generalization capabilities suggest it may be particularly valuable for tasks like improving accessibility through real-time captioning, language translation, and other speech-to-text use cases. However, the model's performance can vary across languages, accents, and demographics, so users should carefully evaluate its performance in their specific domain before deployment. Things to try One interesting aspect of the whisper-large-v2 model is its ability to perform long-form transcription of audio samples longer than 30 seconds. By using a chunking algorithm, the model can transcribe audio of arbitrary length, making it a useful tool for transcribing podcasts, lectures, and other long-form audio content. Users can also experiment with fine-tuning the model on their own data to further improve its performance for specific use cases.

Updated Invalid Date

Audio-to-Text