distil-medium.en

109

Last updated 5/28/2024

🌀

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The distil-medium.en model is a distilled version of the Whisper medium.en model proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. It is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets compared to the original Whisper medium.en model. This makes it an efficient alternative for English speech recognition tasks.

The model is part of the Distil-Whisper repository, which contains several distilled variants of the Whisper model. The distil-large-v2 model is another example, which surpasses the performance of the original Whisper large-v2 model.

Model inputs and outputs

Inputs

Audio data: The model takes audio data as input, in the form of log-Mel spectrograms.

Outputs

Transcription text: The model outputs transcribed text in the same language as the input audio.

Capabilities

The distil-medium.en model demonstrates strong performance on English speech recognition tasks, achieving a short-form WER of 11.1% and a long-form WER of 12.4% on out-of-distribution evaluation sets. It is significantly more efficient than the original Whisper medium.en model, running 6.8 times faster with 49% fewer parameters.

What can I use it for?

The distil-medium.en model is well-suited for a variety of English speech recognition applications, such as transcribing audio recordings, live captioning, and voice-to-text conversion. Its efficiency makes it a practical choice for real-world deployment, particularly in scenarios where latency and model size are important considerations.

Things to try

You can use the distil-medium.en model with the Hugging Face Transformers library to perform short-form transcription of audio samples. The model can also be used for long-form transcription by leveraging the chunking capabilities of the pipeline class, allowing it to handle audio files of arbitrary length.

Additionally, the Distil-Whisper repository provides training code that you can use to distill the Whisper model on other languages, expanding the model's capabilities beyond English. If you're interested in distilling Whisper for your language, be sure to check out the training code.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

✨

distil-small.en

distil-whisper

The distil-small.en model is a distilled version of the Whisper model proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. It is the smallest Distil-Whisper checkpoint, with just 166M parameters, making it the ideal choice for memory constrained applications. Compared to the Whisper small.en model, distil-small.en is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets. For most other applications, the distil-medium.en or distil-large-v2 checkpoints are recommended, since they are both faster and achieve better WER results. Model inputs and outputs The distil-small.en model is an automatic speech recognition (ASR) model that takes audio as input and generates a text transcript as output. It uses an encoder-decoder architecture, where the encoder maps the audio input to a sequence of hidden representations, and the decoder auto-regressively generates the output text. Inputs Audio data in the form of a raw waveform or log-mel spectrogram Outputs A text transcript of the input audio Capabilities The distil-small.en model is capable of transcribing English speech with high accuracy, even on out-of-distribution datasets. It demonstrates robust performance in the presence of accents, background noise, and technical language. The distilled model maintains performance close to the larger Whisper small.en model, while being significantly faster and smaller. What can I use it for? The distil-small.en model is well-suited for deployment in memory-constrained environments, such as on-device applications, where the small model size is a key requirement. It can be used to add high-quality speech transcription capabilities to a wide range of applications, from accessibility tools to voice interfaces. Things to try One interesting thing to try with the distil-small.en model is to use it as an assistant model for speculative decoding with the larger Whisper models. By combining distil-small.en with Whisper, you can obtain the exact same outputs as Whisper while being 2 times faster, making it a drop-in replacement for existing Whisper pipelines.

Updated Invalid Date

Audio-to-Text

📊

distil-large-v2

distil-whisper

490

The distil-large-v2 model is a distilled version of the Whisper large-v2 model. It is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets compared to the larger Whisper model. This makes it a more efficient alternative for speech recognition tasks. The Distil-Whisper repository provides the training code used to create this model. Model inputs and outputs The distil-large-v2 model is a speech recognition model that takes audio as input and outputs text transcriptions. It can handle audio of up to 30 seconds in length, and can be used for both short-form and long-form transcription. Inputs Audio data (e.g. wav, mp3, etc.) Outputs Text transcription of the input audio Optional: Timestamps for the transcribed text Capabilities The distil-large-v2 model demonstrates strong performance on speech recognition tasks, performing within 1% WER of the larger Whisper large-v2 model. It is particularly adept at handling accents, background noise, and technical language. The model can also be used for zero-shot translation from multiple languages into English. What can I use it for? The distil-large-v2 model is well-suited for applications that require efficient and accurate speech recognition, such as automated transcription, accessibility tools, and language learning applications. Its speed and size also suggest that it could be used as a building block for more complex speech-to-text systems. Things to try One interesting aspect of the distil-large-v2 model is its ability to perform long-form transcription through the use of a chunking algorithm. This allows the model to transcribe audio samples of arbitrary length, which could be useful for transcribing podcasts, lectures, or other long-form audio content.

Updated Invalid Date

Audio-to-Text

🤔

whisper-medium.en

openai

The whisper-medium.en model is an English-only version of the Whisper pre-trained model for automatic speech recognition (ASR) and speech translation. Developed by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model was trained on 680k hours of labelled speech data using large-scale weak supervision. Similar models in the Whisper family include the whisper-tiny.en, whisper-small, and whisper-large checkpoints, which vary in size and performance. The whisper-medium.en model sits in the middle of this range, with 769 million parameters. Model inputs and outputs Inputs Audio waveform as a numpy array Sampling rate of the input audio Outputs Text transcription of the input audio, in the same language as the input Optionally, timestamps for the start and end of each transcribed text chunk Capabilities The whisper-medium.en model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It can also perform zero-shot translation from multiple languages into English. The model's accuracy on speech recognition and translation tasks is near state-of-the-art level. However, the model's weakly supervised training on large-scale noisy data means it may generate text that is not actually spoken in the audio input (hallucination). It also performs unevenly across languages, with lower accuracy on low-resource and low-discoverability languages. The model's sequence-to-sequence architecture makes it prone to generating repetitive text. What can I use it for? The whisper-medium.en model is primarily intended for use by AI researchers studying the robustness, generalization, capabilities, biases, and limitations of large language models. However, it may also be useful as an ASR solution for developers, especially for English speech recognition. The model's transcription capabilities could potentially be used to improve accessibility tools. While the model cannot be used for real-time transcription out of the box, its speed and size suggest that others may be able to build applications on top of it that enable near-real-time speech recognition and translation. There are also potential concerns around dual use, as the model's capabilities could enable more actors to build surveillance technologies or scale up existing efforts. The model may also have some ability to recognize specific individuals, which raises safety and privacy concerns. Things to try One interesting aspect of the whisper-medium.en model is its ability to perform speech translation in addition to transcription. You could experiment with using the model to translate audio from one language to another, or compare its performance on transcription versus translation tasks. Another area to explore is the model's robustness to different types of audio input, such as recordings with background noise, accents, or technical terminology. You could also investigate how the model's performance varies across different languages and demographics. Finally, you could look into fine-tuning the pre-trained whisper-medium.en model on a specific dataset or task, as described in the Fine-Tune Whisper with Transformers blog post. This could help improve the model's predictive capabilities for certain use cases.

Updated Invalid Date

Audio-to-Text

🔮

whisper-medium

openai

176

The whisper-medium model is a pre-trained speech recognition and translation model developed by OpenAI. It is part of the Whisper family of models, which demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The whisper-medium model has 769 million parameters and is trained on either English-only or multilingual data. It can be used for both speech recognition, where it transcribes audio in the same language, and speech translation, where it transcribes audio to a different language. The Whisper models are available in a range of sizes, from the whisper-tiny with 39 million parameters to the whisper-large and whisper-large-v2 with 1.55 billion parameters. Model inputs and outputs Inputs Audio samples in various formats and sampling rates Outputs Transcriptions of the input audio, either in the same language (speech recognition) or a different language (speech translation) Optionally, the model can also output timestamps for the transcribed text Capabilities The Whisper models demonstrate strong performance on a variety of speech recognition and translation tasks, including handling accents, background noise, and technical language. They can be used in zero-shot translation, taking audio in one language and translating it to English without any fine-tuning. However, the models can also sometimes generate text that is not actually present in the audio input (known as "hallucination"), and their performance can vary across different languages and accents. What can I use it for? The whisper-medium model and the other Whisper models can be useful for developers and researchers working on improving accessibility tools, such as closed captioning or subtitle generation. The models' speed and accuracy suggest they could be used to build near-real-time speech recognition and translation applications. However, users should be aware of the models' limitations, particularly around potential biases and disparate performance across languages and accents. Things to try One interesting aspect of the Whisper models is their ability to handle audio of up to arbitrary length through a chunking algorithm. This allows the models to be used for long-form transcription, where the audio is split into smaller segments and then reassembled. Users can experiment with this functionality to see how it performs on their specific use cases. Additionally, the Whisper models can be fine-tuned on smaller, domain-specific datasets to improve their performance in particular areas. The blog post on fine-tuning Whisper provides a step-by-step guide on how to do this.

Updated Invalid Date

Text-to-Text