## Model overview

The `whisper-large-v3-turbo` model is a finetuned version of the [Whisper large-v3](https://aimodels.fyi/models/huggingFace/whisper-large-v3-openai) model, a state-of-the-art automatic speech recognition (ASR) and speech translation model proposed by Alec Radford et al. from OpenAI. Trained on over 5 million hours of labeled data, Whisper demonstrates strong generalization to many datasets and domains without the need for fine-tuning. The `whisper-large-v3-turbo` model has a reduced number of decoding layers from 32 to 4, resulting in a faster model but with a minor quality degradation.

## Model inputs and outputs

The `whisper-large-v3-turbo` model takes audio samples as input and generates transcribed text as output. It can be used for both speech recognition, where the output is in the same language as the input audio, as well as speech translation, where the output is in a different language.

### Inputs
- **Audio samples**: The model accepts raw audio waveforms sampled at 16kHz or 44.1kHz.

### Outputs
- **Transcribed text**: The model generates text transcriptions of the input audio.
- **Timestamps (optional)**: The model can also generate timestamps indicating the start and end time of each transcribed segment.

## Capabilities

The Whisper models demonstrate strong performance on speech recognition and translation tasks, exhibiting improved robustness to accents, background noise, and technical language compared to many existing ASR systems. The models can also perform zero-shot translation from multiple languages into English.

## What can I use it for?

The `whisper-large-v3-turbo` model can be useful for a variety of applications, such as:

- **Transcription and translation**: The model can be used to transcribe audio in various languages and translate it to English or other target languages.
- **Accessibility tools**: The model's transcription capabilities can be leveraged to improve accessibility, such as live captioning or subtitling for audio/video content.
- **Voice interaction and assistants**: The model's ASR and translation abilities can be integrated into voice-based interfaces and digital assistants.

## Things to try

One interesting aspect of the Whisper models is their ability to automatically determine the language of the input audio and perform the appropriate task (recognition or translation) without any additional prompting. You can experiment with this by providing audio samples in different languages and observing how the model handles the task.

Additionally, the models support returning word-level timestamps, which can be useful for applications that require precise alignment between the transcribed text and the audio. Try using the `return_timestamps="word"` parameter to see the word-level timing information.

Ylacombe

Models by this creator

whisper-large-v3-turbo