MelHuBERT: A simplified HuBERT on Mel spectrograms

Read original: arXiv:2211.09944 - Published 9/2/2024 by Tzu-Quan Lin, Hung-yi Lee, Hao Tang

🌐

Overview

Self-supervised speech models like HuBERT have been successful, but require significant computing power to train
This paper introduces MelHuBERT, which improves and simplifies HuBERT's key components to reduce training time and computational cost
MelHuBERT achieves similar performance on downstream tasks like phone recognition and speech recognition, while reducing pre-training time by 31.2% and computational cost by 33.5%

Plain English Explanation

Speech recognition and related tasks have seen major progress thanks to self-supervised learning models, which can learn useful representations of speech data without needing labeled data. One such model is HuBERT, which has shown impressive results.

However, training these self-supervised models typically requires a lot of computing power, with multiple GPUs running for a long time. This makes it difficult for many researchers and developers to work with these models.

The researchers in this paper wanted to make self-supervised speech models more accessible by reducing the computational cost and training time. They introduced a new model called MelHuBERT, which builds on the success of HuBERT but with several key improvements:

Simplified the loss function used for training
Changed the input representation to use mel-spectrograms instead of raw waveforms
Broke up the training process into multiple stages

These changes allowed MelHuBERT to achieve similar performance to HuBERT on tasks like speech recognition and speaker identification, while reducing the pre-training time by over 30% and the overall computational cost by a third. This makes MelHuBERT a more practical and accessible option for researchers and developers working on speech-related applications.

Technical Explanation

The key innovations in MelHuBERT compared to the original HuBERT model include:

Simplified Loss Function: The HuBERT loss function had multiple components, including a masked prediction task and a contrastive loss. MelHuBERT simplifies this to a single masked prediction loss, which reduces the overall training complexity.
Mel-spectrogram Inputs: Instead of using raw waveform audio as input, MelHuBERT uses mel-spectrograms. This compressed representation captures the key frequency information in the audio while reducing the input dimensionality.
Multi-stage Training: HuBERT was trained end-to-end in a single stage. MelHuBERT breaks up the training into multiple stages, first training a base model on the masked prediction task, and then fine-tuning that model further. This incremental approach helps stabilize training.

Through these changes, the researchers were able to reduce the overall pre-training time of MelHuBERT by 31.2% compared to HuBERT, while also reducing the computational cost (measured in MACs) by 33.5% per second of speech. Importantly, MelHuBERT maintained similar performance to HuBERT on downstream tasks like phone recognition, speaker identification, and automatic speech recognition.

Critical Analysis

The researchers provide a thorough analysis of the benefits of MelHuBERT, but there are a few potential limitations worth considering:

The paper does not extensively explore the tradeoffs between model size, training time, and performance. It's possible that further reductions in computational cost could come at the expense of model accuracy.
The experiments were conducted on English-only datasets. It's unclear how well the MelHuBERT approach would generalize to multilingual or low-resource speech tasks, which are important real-world use cases.
The paper focuses on reducing pre-training time, but does not address potential challenges in fine-tuning or deploying MelHuBERT models. Inference-time efficiency is also an important practical consideration.

Overall, the MelHuBERT work represents a promising step towards more efficient self-supervised speech models. Further research is needed to fully understand the tradeoffs and expand the model's capabilities to a broader range of speech applications.

Conclusion

This paper introduces MelHuBERT, an improved version of the successful HuBERT self-supervised speech model. By simplifying key components like the loss function and input representation, and breaking up training into multiple stages, the researchers were able to reduce the pre-training time by over 30% and the computational cost by a third, while maintaining similar performance on downstream tasks.

These efficiency gains make MelHuBERT a more practical and accessible option for researchers and developers working on speech-related applications, where computational resources can be a significant constraint. The work represents an important step towards making advanced self-supervised speech models more widely usable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

MelHuBERT: A simplified HuBERT on Mel spectrograms

Tzu-Quan Lin, Hung-yi Lee, Hao Tang

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.

9/2/2024

Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks

Duo Ma, Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li

Human language can be expressed in either written or spoken form, i.e. text or speech. Humans can acquire knowledge from text to improve speaking and listening. However, the quest for speech pre-trained models to leverage unpaired text has just started. In this paper, we investigate a new way to pre-train such a joint speech-text model to learn enhanced speech representations and benefit various speech-related downstream tasks. Specifically, we propose a novel pre-training method, text-guided HuBERT, or T-HuBERT, which performs self-supervised learning over speech to derive phoneme-like discrete representations. And these phoneme-like pseudo-label sequences are firstly derived from speech via the generative adversarial networks (GAN) to be statistically similar to those from additional unpaired textual data. In this way, we build a bridge between unpaired speech and text in an unsupervised manner. Extensive experiments demonstrate the significant superiority of our proposed method over various strong baselines, which achieves up to 15.3% relative Word Error Rate (WER) reduction on the LibriSpeech dataset.

8/6/2024

New!Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT

Ryota Komatsu, Takahiro Shinozaki

Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. Recent advances highlight the potential of deriving discrete symbols from the features correlated with linguistic units, which enables text-less training across diverse tasks. In particular, sentence-level Self-Distillation of the pretrained HuBERT (SD-HuBERT) induces syllabic structures within latent speech frame representations extracted from an intermediate Transformer layer. In SD-HuBERT, sentence-level representation is accumulated from speech frame features through self-attention layers using a special CLS token. However, we observe that the information aggregated in the CLS token correlates more with speaker identity than with linguistic content. To address this, we propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information. Our method introduces speaker perturbation as data augmentation and adopts a frame-level training objective to prevent the CLS token from aggregating paralinguistic information. Experimental results show that our approach surpasses the current state-of-the-art method in most syllable segmentation and syllabic unit quality metrics on Librispeech, underscoring its effectiveness in promoting syllabic organization within speech-only models.

9/17/2024

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.

8/16/2024