Vaibhavs10

Models by this creator

incredibly-fast-whisper

2.4K

The incredibly-fast-whisper model is an opinionated CLI tool built on top of the OpenAI Whisper large-v3 model, which is designed to enable blazingly fast audio transcription. Powered by Hugging Face Transformers, Optimum, and Flash Attention 2, the model can transcribe 150 minutes of audio in less than 98 seconds, a significant performance improvement over the standard Whisper model. This tool is part of a community-driven project started by vaibhavs10 to showcase advanced Transformers optimizations. The incredibly-fast-whisper model is comparable to other Whisper-based models like whisperx, whisper-diarization, and metavoice, each of which offers its own unique set of features and optimizations for speech-to-text transcription. Model inputs and outputs Inputs Audio file**: The primary input for the incredibly-fast-whisper model is an audio file, which can be provided as a local file path or a URL. Task**: The model supports two main tasks: transcription (the default) and translation to another language. Language**: The language of the input audio, which can be specified or left as "None" to allow the model to auto-detect the language. Batch size**: The number of parallel batches to compute, which can be adjusted to avoid out-of-memory (OOM) errors. Timestamp format**: The model can output timestamps at either the chunk or word level. Diarization**: The model can use Pyannote.audio to perform speaker diarization, but this requires providing a Hugging Face API token. Outputs The primary output of the incredibly-fast-whisper model is a transcription of the input audio, which can be saved to a JSON file. Capabilities The incredibly-fast-whisper model leverages several advanced optimizations to achieve its impressive transcription speed, including the use of Flash Attention 2 and BetterTransformer. These optimizations allow the model to significantly outperform the standard Whisper large-v3 model in terms of transcription speed, while maintaining high accuracy. What can I use it for? The incredibly-fast-whisper model is well-suited for applications that require real-time or near-real-time audio transcription, such as live captioning, podcast production, or meeting transcription. The model's speed and efficiency make it a compelling choice for these types of use cases, especially when dealing with large amounts of audio data. Things to try One interesting feature of the incredibly-fast-whisper model is its support for the distil-whisper/large-v2 checkpoint, which is a smaller and more efficient version of the Whisper model. Users can experiment with this checkpoint to find the right balance between speed and accuracy for their specific use case. Additionally, the model's ability to leverage Flash Attention 2 and BetterTransformer optimizations opens up opportunities for further experimentation and customization. Users can explore different configurations of these optimizations to see how they impact transcription speed and quality.

Updated 7/2/2024

Audio-to-Text