Belle-whisper-large-v3-zh

Maintainer: BELLE-2

Last updated 5/28/2024

🤖

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The Belle-whisper-large-v3-zh model is a fine-tuned version of the Whisper large model, demonstrating a 24-65% relative improvement in performance on Chinese ASR benchmarks compared to the original Whisper large model. Developed by the BELLE-2 team, this model has been optimized for enhanced Chinese speech recognition capabilities.

Compared to the Whisper-large-v3 model, which shows improved performance across a wide variety of languages, the Belle-whisper-large-v3-zh model focuses specifically on improving accuracy for Chinese speech recognition. It was fine-tuned on datasets like AISHELL1, AISHELL2, WENETSPEECH, and HKUST to achieve these gains.

Model inputs and outputs

Inputs

Audio files: The model takes audio files as input and performs speech recognition or transcription.

Outputs

Transcription text: The model outputs the transcribed text from the input audio file.

Capabilities

The Belle-whisper-large-v3-zh model demonstrates significantly improved performance on Chinese speech recognition tasks compared to the original Whisper large model. This makes it well-suited for applications that require accurate Chinese speech-to-text transcription, such as meeting transcripts, voice assistants, and captioning for Chinese media.

What can I use it for?

The Belle-whisper-large-v3-zh model can be particularly useful for developers and researchers working on Chinese speech recognition applications. It could be integrated into products or services that require accurate Chinese transcription, such as:

Automated captioning and subtitling for Chinese videos and podcasts
Voice-controlled smart home devices and virtual assistants for Chinese-speaking users
Meeting and conference transcription services for Chinese-language businesses

Things to try

One interesting aspect of the Belle-whisper-large-v3-zh model is its ability to handle complex acoustic environments, such as the WENETSPEECH meeting dataset. Developers could experiment with using this model to transcribe audio from noisy or challenging settings, like crowded offices or public spaces, to see how it performs compared to other ASR systems.

Additionally, the provided fine-tuning instructions offer an opportunity to further customize the model's performance by training it on domain-specific data. Researchers could explore how fine-tuning the model on additional Chinese speech datasets or specialized vocabularies might impact its transcription accuracy for their particular use case.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

📶

whisper-large-zh-cv11

jonatasgrosman

The whisper-large-zh-cv11 model is a fine-tuned version of the openai/whisper-large-v2 model on Chinese (Mandarin) using the train and validation splits of the Common Voice 11 dataset. This model demonstrates improved performance on Chinese speech recognition compared to the original Whisper large model, with a 24-65% relative improvement on benchmarks like AISHELL1, AISHELL2, WENETSPEECH, and HKUST. Two similar models are the wav2vec2-large-xlsr-53-chinese-zh-cn and Belle-whisper-large-v3-zh models, which also target Chinese speech recognition with fine-tuning on various datasets. Model inputs and outputs Inputs Audio**: The model takes audio files as input, which can be in various formats like .wav, .mp3, etc. The audio should be sampled at 16kHz. Outputs Transcription**: The model outputs a transcription of the input audio in Chinese (Mandarin). The transcription includes casing and punctuation. Capabilities The whisper-large-zh-cv11 model demonstrates strong performance on Chinese speech recognition tasks, outperforming the original Whisper large model by a significant margin. It is able to handle a variety of accents, background noise, and technical language in the audio input. What can I use it for? This model can be used to build applications that require accurate Chinese speech transcription, such as: Transcription of lecture recordings, interviews, or meetings Subtitling and captioning for Chinese-language videos Voice interfaces and virtual assistants for Mandarin speakers The model's performance improvements over the original Whisper large model make it a more viable option for commercial deployment in Chinese-language applications. Things to try One interesting aspect of this model is its ability to transcribe both numerical values and more complex language. You could try testing the model's performance on audio with a mix of numerical and text-based content, and see how it compares to the original Whisper large model or other Chinese ASR models. Another idea is to fine-tune the model further on your own domain-specific data to see if you can achieve even better results for your particular use case. The Fine-Tune Whisper with Transformers blog post provides a guide on how to approach fine-tuning Whisper models.

Updated Invalid Date

Audio-to-Text

🎲

whisper-large-v3

openai

2.6K

The whisper-large-v3 model is a general-purpose speech recognition model developed by OpenAI. It is the latest version of the Whisper model, building on the previous Whisper large models. The whisper-large-v3 model has a few minor architectural differences from the previous large models, including using 128 Mel frequency bins instead of 80 and adding a new language token for Cantonese. The Whisper model was trained on a massive 680,000 hours of audio data, with 65% English data, 18% non-English data with English transcripts, and 17% non-English data with non-English transcripts covering 98 languages. This allows the model to perform well on a diverse range of speech recognition and translation tasks, without needing to fine-tune on specific datasets. Similar Whisper models include the Whisper medium, Whisper tiny, and the whisper-large-v3 model developed by Nate Raw. There is also an incredibly fast version of the Whisper large model by Vaibhav Srivastav. Model inputs and outputs The whisper-large-v3 model takes audio samples as input and generates text transcripts as output. The audio can be in any of the 98 languages covered by the training data. The model can also be used for speech translation, where it generates text in a different language than the audio. Inputs Audio samples in any of the 98 languages the model was trained on Outputs Text transcripts of the audio in the same language Translated text transcripts in a different language Capabilities The whisper-large-v3 model demonstrates strong performance on a variety of speech recognition and translation tasks, with 10-20% lower error rates compared to the previous Whisper large model. It is robust to accents, background noise, and technical language, and can perform zero-shot translation from multiple languages into English. However, the model's performance is uneven across languages, with lower accuracy on low-resource and low-discoverability languages where less training data was available. It also has a tendency to generate repetitive or hallucinated text that is not actually present in the audio input. What can I use it for? The primary intended use of the Whisper models is for AI researchers studying model capabilities, robustness, and limitations. However, the models can also be quite useful as a speech recognition solution for developers, especially for English transcription tasks. The Whisper models could be used to build applications that improve accessibility, such as closed captioning or voice-to-text transcription. While the models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build near-real-time applications on top of them. Things to try One interesting aspect of the Whisper models is their ability to perform speech translation, generating text transcripts in a different language than the audio input. Developers could experiment with using the model for tasks like simultaneous interpretation or multilingual subtitling. Another avenue to explore is fine-tuning the pre-trained Whisper model on specific datasets or domains. The blog post Fine-Tune Whisper with Transformers provides a guide on how to fine-tune the model with as little as 5 hours of labeled data, which can improve performance on particular languages or use cases.

Updated Invalid Date

Text-to-Text

🤯

whisper-large-v2

openai

1.6K

The whisper-large-v2 model is a pre-trained Transformer-based encoder-decoder model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labeled data by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Compared to the original Whisper large model, the whisper-large-v2 model has been trained for 2.5x more epochs with added regularization for improved performance. Model inputs and outputs Inputs Audio samples**: The model takes audio samples as input and performs either speech recognition or speech translation. Outputs Text transcription**: The model outputs text transcriptions of the input audio. For speech recognition, the transcription is in the same language as the audio. For speech translation, the transcription is in a different language than the audio. Timestamps (optional)**: The model can optionally output timestamps for the transcribed text. Capabilities The whisper-large-v2 model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to translate speech from multiple languages into English with high accuracy. What can I use it for? The whisper-large-v2 model can be a useful tool for developers building speech recognition and translation applications. Its strong generalization capabilities suggest it may be particularly valuable for tasks like improving accessibility through real-time captioning, language translation, and other speech-to-text use cases. However, the model's performance can vary across languages, accents, and demographics, so users should carefully evaluate its performance in their specific domain before deployment. Things to try One interesting aspect of the whisper-large-v2 model is its ability to perform long-form transcription of audio samples longer than 30 seconds. By using a chunking algorithm, the model can transcribe audio of arbitrary length, making it a useful tool for transcribing podcasts, lectures, and other long-form audio content. Users can also experiment with fine-tuning the model on their own data to further improve its performance for specific use cases.

Updated Invalid Date

Audio-to-Text

🏋️

BELLE-7B-2M

BelleGroup

186

BELLE-7B-2M is a 7 billion parameter language model fine-tuned by the BelleGroup on a dataset of 2 million Chinese and 50,000 English samples. It is based on the Bloomz-7b1-mt model and has good Chinese instruction understanding and response generation capabilities. The model can be easily loaded using the AutoModelForCausalLM from Transformers. Similar models include the Llama-2-13B-GGML model created by TheBloke, which is a GGML version of Meta's Llama 2 13B model. Both models are large language models trained on internet data and optimized for instructional tasks. Model inputs and outputs Inputs Text input in the format Human: {input} \n\nAssistant: Outputs Textual responses generated by the model, continuing the conversation from the provided input Capabilities The BELLE-7B-2M model demonstrates strong performance on Chinese instruction understanding and response generation tasks. It can engage in open-ended conversations, provide informative answers to questions, and assist with a variety of language-based tasks. What can I use it for? The BELLE-7B-2M model could be useful for building conversational AI assistants, chatbots, or language-based applications targeting Chinese and English users. Its robust performance on instructional tasks makes it well-suited for applications that require understanding and following user instructions. Things to try You could try prompting the BELLE-7B-2M model with open-ended questions or tasks to see the breadth of its capabilities. For example, you could ask it to summarize an article, generate creative writing, or provide step-by-step instructions for a DIY project. Experimenting with different prompts and use cases can help you better understand the model's strengths and limitations.

Updated Invalid Date

Text-to-Text