whisper-large-zh-cv11

Last updated 5/28/2024

📶

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The whisper-large-zh-cv11 model is a fine-tuned version of the openai/whisper-large-v2 model on Chinese (Mandarin) using the train and validation splits of the Common Voice 11 dataset. This model demonstrates improved performance on Chinese speech recognition compared to the original Whisper large model, with a 24-65% relative improvement on benchmarks like AISHELL1, AISHELL2, WENETSPEECH, and HKUST.

Two similar models are the wav2vec2-large-xlsr-53-chinese-zh-cn and Belle-whisper-large-v3-zh models, which also target Chinese speech recognition with fine-tuning on various datasets.

Model inputs and outputs

Inputs

Audio: The model takes audio files as input, which can be in various formats like .wav, .mp3, etc. The audio should be sampled at 16kHz.

Outputs

Transcription: The model outputs a transcription of the input audio in Chinese (Mandarin). The transcription includes casing and punctuation.

Capabilities

The whisper-large-zh-cv11 model demonstrates strong performance on Chinese speech recognition tasks, outperforming the original Whisper large model by a significant margin. It is able to handle a variety of accents, background noise, and technical language in the audio input.

What can I use it for?

This model can be used to build applications that require accurate Chinese speech transcription, such as:

Transcription of lecture recordings, interviews, or meetings
Subtitling and captioning for Chinese-language videos
Voice interfaces and virtual assistants for Mandarin speakers

The model's performance improvements over the original Whisper large model make it a more viable option for commercial deployment in Chinese-language applications.

Things to try

One interesting aspect of this model is its ability to transcribe both numerical values and more complex language. You could try testing the model's performance on audio with a mix of numerical and text-based content, and see how it compares to the original Whisper large model or other Chinese ASR models.

Another idea is to fine-tune the model further on your own domain-specific data to see if you can achieve even better results for your particular use case. The Fine-Tune Whisper with Transformers blog post provides a guide on how to approach fine-tuning Whisper models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🤖

wav2vec2-large-xlsr-53-chinese-zh-cn

jonatasgrosman

wav2vec2-large-xlsr-53-chinese-zh-cn is a fine-tuned version of the Facebook/wav2vec2-large-xlsr-53 model for speech recognition in Chinese. The model was fine-tuned on the train and validation splits of Common Voice 6.1, CSS10, and ST-CMDS datasets. This model can be used for transcribing Chinese speech audio that is sampled at 16kHz. Model inputs and outputs Inputs Audio files**: The model takes in audio files sampled at 16kHz. Outputs Transcripts**: The model outputs transcripts of the input speech audio in Chinese. Capabilities The wav2vec2-large-xlsr-53-chinese-zh-cn model demonstrates strong performance for speech recognition in the Chinese language. It was fine-tuned on a diverse set of Chinese speech datasets, allowing it to handle a variety of accents and domains. What can I use it for? This model can be used to transcribe Chinese speech audio for a variety of applications, such as automated captioning, voice interfaces, and speech-to-text pipelines. It could be particularly useful for developers building Chinese language products or services that require speech recognition capabilities. Things to try One interesting thing to try with this model is to compare its performance on different Chinese speech datasets or audio samples. This could help identify areas where the model excels or struggles, and inform future fine-tuning or model development efforts. Additionally, combining this model with language models or other components in a larger speech processing pipeline could lead to interesting applications.

Updated Invalid Date

Audio-to-Text

🤖

Belle-whisper-large-v3-zh

BELLE-2

The Belle-whisper-large-v3-zh model is a fine-tuned version of the Whisper large model, demonstrating a 24-65% relative improvement in performance on Chinese ASR benchmarks compared to the original Whisper large model. Developed by the BELLE-2 team, this model has been optimized for enhanced Chinese speech recognition capabilities. Compared to the Whisper-large-v3 model, which shows improved performance across a wide variety of languages, the Belle-whisper-large-v3-zh model focuses specifically on improving accuracy for Chinese speech recognition. It was fine-tuned on datasets like AISHELL1, AISHELL2, WENETSPEECH, and HKUST to achieve these gains. Model inputs and outputs Inputs Audio files**: The model takes audio files as input and performs speech recognition or transcription. Outputs Transcription text**: The model outputs the transcribed text from the input audio file. Capabilities The Belle-whisper-large-v3-zh model demonstrates significantly improved performance on Chinese speech recognition tasks compared to the original Whisper large model. This makes it well-suited for applications that require accurate Chinese speech-to-text transcription, such as meeting transcripts, voice assistants, and captioning for Chinese media. What can I use it for? The Belle-whisper-large-v3-zh model can be particularly useful for developers and researchers working on Chinese speech recognition applications. It could be integrated into products or services that require accurate Chinese transcription, such as: Automated captioning and subtitling for Chinese videos and podcasts Voice-controlled smart home devices and virtual assistants for Chinese-speaking users Meeting and conference transcription services for Chinese-language businesses Things to try One interesting aspect of the Belle-whisper-large-v3-zh model is its ability to handle complex acoustic environments, such as the WENETSPEECH meeting dataset. Developers could experiment with using this model to transcribe audio from noisy or challenging settings, like crowded offices or public spaces, to see how it performs compared to other ASR systems. Additionally, the provided fine-tuning instructions offer an opportunity to further customize the model's performance by training it on domain-specific data. Researchers could explore how fine-tuning the model on additional Chinese speech datasets or specialized vocabularies might impact its transcription accuracy for their particular use case.

Updated Invalid Date

Audio-to-Text

🤯

whisper-large-v2

openai

1.6K

The whisper-large-v2 model is a pre-trained Transformer-based encoder-decoder model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labeled data by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Compared to the original Whisper large model, the whisper-large-v2 model has been trained for 2.5x more epochs with added regularization for improved performance. Model inputs and outputs Inputs Audio samples**: The model takes audio samples as input and performs either speech recognition or speech translation. Outputs Text transcription**: The model outputs text transcriptions of the input audio. For speech recognition, the transcription is in the same language as the audio. For speech translation, the transcription is in a different language than the audio. Timestamps (optional)**: The model can optionally output timestamps for the transcribed text. Capabilities The whisper-large-v2 model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to translate speech from multiple languages into English with high accuracy. What can I use it for? The whisper-large-v2 model can be a useful tool for developers building speech recognition and translation applications. Its strong generalization capabilities suggest it may be particularly valuable for tasks like improving accessibility through real-time captioning, language translation, and other speech-to-text use cases. However, the model's performance can vary across languages, accents, and demographics, so users should carefully evaluate its performance in their specific domain before deployment. Things to try One interesting aspect of the whisper-large-v2 model is its ability to perform long-form transcription of audio samples longer than 30 seconds. By using a chunking algorithm, the model can transcribe audio of arbitrary length, making it a useful tool for transcribing podcasts, lectures, and other long-form audio content. Users can also experiment with fine-tuning the model on their own data to further improve its performance for specific use cases.

Updated Invalid Date

Audio-to-Text

🔎

whisper-large

openai

438

The whisper-large model is a pre-trained AI model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. Trained on 680k hours of labelled data, the Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The whisper-large-v2 model is a newer version that surpasses the performance of the original whisper-large model, with no architecture changes. The whisper-medium model is a slightly smaller version with 769M parameters, while the whisper-tiny model is the smallest at 39M parameters. All of these Whisper models are available on the Hugging Face Hub. Model inputs and outputs Inputs Audio samples, which the model converts to log-Mel spectrograms Outputs Textual transcriptions of the input audio, either in the same language as the audio (for speech recognition) or in a different language (for speech translation) The model can also output timestamps for the transcriptions Capabilities The Whisper models demonstrate strong performance on a variety of speech recognition and translation tasks, exhibiting improved robustness to accents, background noise, and technical language. They can also perform zero-shot translation from multiple languages into English. However, the models may occasionally produce text that is not actually spoken in the audio input, a phenomenon known as "hallucination". Their performance also varies across languages, with lower accuracy on low-resource and less common languages. What can I use it for? The Whisper models are primarily intended for use by AI researchers studying model robustness, generalization, capabilities, biases, and constraints. However, the models can also be useful for developers looking to build speech recognition or translation applications, especially for English speech. The models' speed and accuracy make them well-suited for applications that require transcription or translation of large volumes of audio data, such as accessibility tools, media production, and language learning. Developers can build applications on top of the models to enable near-real-time speech recognition and translation. Things to try One interesting aspect of the Whisper models is their ability to perform long-form transcription of audio samples longer than 30 seconds. This is achieved through a chunking algorithm that allows the model to process audio of arbitrary length. Another unique feature is the model's ability to automatically detect the language of the input audio and perform the appropriate speech recognition or translation task. Developers can leverage this by providing the model with "context tokens" that inform it of the desired task and language. Finally, the pre-trained Whisper models can be fine-tuned on smaller datasets to further improve their performance on specific languages or domains. The Fine-Tune Whisper with Transformers blog post provides a step-by-step guide on how to do this.

Updated Invalid Date

Audio-to-Text