wav2vec2-large-960h-lv60-self

Maintainer: facebook

118

Last updated 5/27/2024

🔎

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

Facebook's Wav2Vec2 is a large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. The model was trained with a Self-Training objective. wav2vec2-large-960h-lv60-self demonstrates state-of-the-art performance on speech recognition tasks, outperforming the previous best semi-supervised methods while using a simpler approach.

Similar models include wav2vec2-base-960h, which is a smaller base model pretrained on the same Librispeech data, and wav2vec2-xls-r-300m, a large multilingual version of Wav2Vec2 pretrained on 436k hours of speech data across 128 languages.

Model inputs and outputs

Inputs

Audio: The model takes raw speech audio as input, which must be sampled at 16kHz.

Outputs

Transcription: The model outputs a text transcription of the input speech audio.

Capabilities

The wav2vec2-large-960h-lv60-self model demonstrates state-of-the-art performance on speech recognition tasks, achieving 1.8/3.3 WER on the clean/other Librispeech test sets when using all labeled data. It can also achieve strong results with limited labeled data, outperforming previous methods on the 100 hour Librispeech subset while using 100 times less labeled data.

What can I use it for?

The wav2vec2-large-960h-lv60-self model is well-suited for building speech recognition systems, particularly for applications that require high accuracy on a variety of speech inputs. It can be used as a standalone acoustic model to transcribe audio files, or integrated into larger speech processing pipelines.

Things to try

One interesting aspect of the wav2vec2-large-960h-lv60-self model is its ability to perform well with limited labeled data. Developers could experiment with fine-tuning the model on domain-specific datasets to adapt it for specialized use cases, potentially achieving strong results even when only a small amount of labeled data is available.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🌐

wav2vec2-base-960h

facebook

241

wav2vec2-base-960h is a pre-trained speech recognition model developed by Facebook. It is based on the Wav2Vec2 architecture and was trained on 960 hours of LibriSpeech data. This model can be used for audio-to-text transcription tasks and demonstrates strong performance, especially on 16kHz sampled speech audio. Compared to similar models like whisper-large-v2 and whisper-large, wav2vec2-base-960h is specifically optimized for English speech recognition, while the Whisper models are more versatile, supporting both speech recognition and translation across multiple languages. Model inputs and outputs Inputs Audio data**: The model takes in 16kHz sampled speech audio as input. Outputs Transcribed text**: The model outputs a transcription of the input audio in the form of text. Capabilities The wav2vec2-base-960h model demonstrates strong performance on English speech recognition tasks, achieving 1.8/3.3 WER on the clean/other test sets of the LibriSpeech dataset. It is capable of handling a variety of audio conditions, including accents, background noise, and technical language. What can I use it for? The wav2vec2-base-960h model can be used for a variety of audio-to-text transcription applications, such as: Generating transcripts for audio recordings, podcasts, or video content Improving accessibility by providing text captions for audio-based media Automating note-taking or meeting transcription Enabling voice-based interfaces or virtual assistants Companies in industries like media, education, and enterprise collaboration could potentially monetize this model by building transcription services or integrating it into their products. Things to try One interesting aspect of the wav2vec2-base-960h model is its ability to handle 16kHz sampled audio. This makes it well-suited for applications where audio quality may be lower, such as telephony or recordings made with mobile devices. Developers could experiment with using this model to transcribe a variety of real-world audio sources and compare its performance to other speech recognition models. Additionally, the model's strong performance on the LibriSpeech dataset suggests it could be a good starting point for fine-tuning on domain-specific datasets or tasks. Researchers and developers could explore ways to adapt the model to their particular use cases, potentially achieving even better results.

Updated Invalid Date

Audio-to-Text

👁️

wav2vec2-base

facebook

wav2vec2-base is a speech recognition model developed by Facebook's AI team. It is the base version of their Wav2Vec2 model, which learns powerful representations from speech audio alone and can outperform semi-supervised methods when fine-tuned on labeled speech data. The similar wav2vec2-base-960h model is the base model further pre-trained and fine-tuned on 960 hours of LibriSpeech data, achieving strong speech recognition performance with 1.8/3.3 WER on the LibriSpeech clean/other test sets. The wav2vec2-large-960h-lv60-self model is a larger variant that was trained with a self-training objective, resulting in even lower WER of 1.9/3.9. Facebook has also released the wav2vec2-xls-r-300m model, a large-scale multilingual pre-trained model with 300 million parameters, trained on 436K hours of speech data across 128 languages. This model can be fine-tuned for a variety of speech tasks like automatic speech recognition, translation, and classification. Model inputs and outputs Inputs Speech audio**: The model takes in raw waveform audio as input, which must be sampled at 16kHz. Outputs Text transcription**: The model outputs a text transcription of the input speech audio. Capabilities The wav2vec2-base model has been shown to achieve strong speech recognition performance, even when fine-tuned on small amounts of labeled data. For example, with just 1 hour of labeled data, it can outperform previous state-of-the-art models trained on 100 hours. This demonstrates the feasibility of building accurate speech recognition systems with limited labeled data. What can I use it for? The wav2vec2-base model can be used as a foundation for building automatic speech recognition (ASR) systems. By fine-tuning the model on domain-specific labeled data, you can create highly accurate transcription models for applications like voice interfaces, video captioning, or meeting transcription. Things to try To use wav2vec2-base for speech recognition, you'll need to create a custom tokenizer and fine-tune the model on labeled text data, as it was pre-trained on audio alone without any text labels. Check out this blog post for a step-by-step guide on how to fine-tune the model for English ASR. You can also explore using the larger, more powerful wav2vec2-large-960h-lv60-self or multilingual wav2vec2-xls-r-300m models if you need higher accuracy or support for multiple languages.

Updated Invalid Date

Audio-to-Text

🚀

w2v-bert-2.0

facebook

116

The w2v-bert-2.0 is a Conformer-based speech encoder model open-sourced by Facebook. It was pre-trained on 4.5 million hours of unlabeled audio data covering over 143 languages and can be fine-tuned for downstream tasks like Automatic Speech Recognition (ASR) or Audio Classification. The model has 600 million parameters and is supported by the Transformers library. Similar models include Wav2Vec2-Base-960h, a base model pre-trained and fine-tuned on 960 hours of Librispeech, and Wav2Vec2-Base, the base model pre-trained on 16kHz speech audio. These models demonstrate the effectiveness of learning representations from speech audio alone and then fine-tuning on labeled data. Model inputs and outputs Inputs Raw audio waveforms Outputs Audio embeddings from the top layer of the model, which can be used for downstream tasks after fine-tuning. Capabilities The w2v-bert-2.0 model was pre-trained on a large and diverse dataset, allowing it to learn powerful representations that can be leveraged for various speech-related tasks. By fine-tuning the model, it can be adapted to perform well on specific datasets and applications, such as Automatic Speech Recognition. What can I use it for? The w2v-bert-2.0 model can be used as a speech encoder in a variety of applications, such as: Automatic Speech Recognition (ASR)**: By fine-tuning the model on a labeled speech dataset, it can be used to transcribe audio into text. Audio Classification**: The model can be fine-tuned to classify audio into different categories, such as speaker identification or emotion recognition. As mentioned in the Transformers usage section, you can use this model to extract audio embeddings and then build your own downstream application on top of it. Things to try One interesting thing to try with the w2v-bert-2.0 model is to explore how it performs on low-resource languages or dialects. Since the model was pre-trained on a diverse dataset, it may be able to leverage its learned representations to achieve good performance even with limited fine-tuning data. You could experiment with fine-tuning the model on different language datasets and compare the results. Another idea is to try combining the w2v-bert-2.0 model with other speech-related models, such as text-to-speech or voice conversion models, to create more sophisticated speech applications. The versatility of this model makes it a valuable component in building advanced speech systems.

Updated Invalid Date

Audio-to-Text

✨

hubert-large-ls960-ft

facebook

The hubert-large-ls960-ft model is a large version of Facebook's Hubert speech recognition model that has been fine-tuned on 960 hours of Librispeech data. Hubert is a self-supervised model for learning speech representations, which was proposed in the Hubert paper. Compared to other Wav2Vec2 models from Facebook like wav2vec2-large-960h-lv60-self, the hubert-large-ls960-ft model is specifically fine-tuned on less data (960 hours vs. 960+53k hours) but achieves strong performance on speech recognition tasks. Model inputs and outputs Inputs Audio**: The model takes in raw audio data sampled at 16kHz as input. Outputs Transcription**: The model outputs a transcription of the input audio in the same language as the audio. Capabilities The hubert-large-ls960-ft model demonstrates strong speech recognition capabilities, especially on the Librispeech benchmark. Compared to the base Hubert model, the fine-tuned version shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets of Librispeech. What can I use it for? The hubert-large-ls960-ft model can be used for automatic speech recognition (ASR) tasks, particularly transcribing English audio. The model's strong performance on the Librispeech benchmark suggests it would be a good choice for transcribing high-quality, read speech in English. However, the model may not perform as well on more diverse, spontaneous speech. Things to try One interesting aspect of the hubert-large-ls960-ft model is that it was fine-tuned on a relatively small amount of data (960 hours) compared to larger speech models. This suggests the base Hubert model is able to learn strong speech representations that can be effectively fine-tuned on domain-specific data. Experimenting with fine-tuning the base Hubert model on your own dataset could be a promising avenue to explore.

Updated Invalid Date

Audio-to-Text