Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

Read original: arXiv:2408.13920 - Published 9/6/2024 by Dionyssos Kounadis-Bastian, Oliver Schrufer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Bjorn Schuller

Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

Overview

This paper proposes a method called Wav2Small for distilling the Wav2Vec2 model down to only 72K parameters while maintaining performance on low-resource speech emotion recognition tasks.
Wav2Small uses a knowledge distillation approach to transfer the learned representations from a large pre-trained Wav2Vec2 model to a much smaller model.
The authors evaluate Wav2Small on two speech emotion recognition datasets and show that it achieves comparable performance to the full-sized Wav2Vec2 model while being over 100x smaller.

Plain English Explanation

The paper introduces a new method called Wav2Small that takes a large, powerful speech recognition model called Wav2Vec2 and distills it down to a much smaller version with only 72,000 parameters. This smaller model, Wav2Small, is able to maintain similar performance to the full Wav2Vec2 model on speech emotion recognition tasks, even when working with limited training data.

The key idea is to use a technique called "knowledge distillation" to transfer the learned representations from the large Wav2Vec2 model to the smaller Wav2Small model. This allows Wav2Small to benefit from the powerful features learned by the original model, without needing to be as large or complex itself.

The researchers evaluate Wav2Small on two different speech emotion recognition datasets and show that it can match the performance of the full Wav2Vec2 model, while being over 100 times smaller. This makes Wav2Small much more practical for deployment on low-resource devices or in scenarios with limited computational power.

Technical Explanation

The paper introduces a method called Wav2Small that distills the Wav2Vec2 model down to just 72,000 parameters while maintaining performance on speech emotion recognition tasks.

Wav2Vec2 is a large, powerful speech recognition model with hundreds of millions of parameters. While effective, its size can make it impractical for deployment on low-resource devices or in scenarios with limited computational resources. To address this, the authors use a knowledge distillation approach to transfer the learned representations from Wav2Vec2 to a much smaller model.

The Wav2Small architecture consists of a feature extractor, a pooling layer, and a classification head. The feature extractor is distilled from the Wav2Vec2 model, allowing Wav2Small to leverage the powerful representations learned by the larger model. The pooling layer aggregates the features, and the classification head performs the emotion recognition task.

The authors evaluate Wav2Small on two speech emotion recognition datasets, CREMA-D and IEMOCAP. They show that Wav2Small achieves comparable performance to the full-sized Wav2Vec2 model, while being over 100 times smaller. This significant reduction in model size makes Wav2Small much more practical for deployment in low-resource settings.

Critical Analysis

The paper presents a compelling approach to reducing the size of a large speech recognition model while maintaining its performance on a specific task, in this case speech emotion recognition. The knowledge distillation technique used to transfer the learned representations from Wav2Vec2 to the smaller Wav2Small model appears to be effective, as evidenced by the strong results on the two evaluation datasets.

One potential limitation of the work is that it has only been evaluated on speech emotion recognition tasks. It would be interesting to see how Wav2Small performs on a broader range of speech-related tasks, such as automatic speech recognition or speaker identification, to better understand the generalizability of the approach.

Additionally, the paper does not provide much insight into the details of the knowledge distillation process or the architectural choices made in designing Wav2Small. More information on these aspects would help readers better understand the key factors contributing to the performance and efficiency of the proposed model.

Overall, the Wav2Small approach is a promising step towards making large, powerful speech models more accessible for deployment in real-world, low-resource scenarios. Further research exploring the broader applicability of the method and its underlying mechanisms could lead to even more impactful advancements in this area.

Conclusion

The Wav2Small method introduced in this paper demonstrates an effective way to significantly reduce the size of a large speech recognition model, Wav2Vec2, while maintaining its performance on speech emotion recognition tasks. By using knowledge distillation to transfer the learned representations to a much smaller model, the authors have created a practical solution for deploying powerful speech models on low-resource devices or in computationally constrained environments.

The strong results on the CREMA-D and IEMOCAP datasets suggest that the Wav2Small approach could have broader applicability beyond just speech emotion recognition. Further research exploring its performance on a wider range of speech-related tasks and its underlying mechanisms could lead to even more impactful advancements in making large, powerful speech models more accessible and usable in real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

Dionyssos Kounadis-Bastian, Oliver Schrufer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Bjorn Schuller

Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.

9/6/2024

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context

Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard

Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

4/1/2024

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

9/12/2024

🗣️

Adapting WavLM for Speech Emotion Recognition

Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin

Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.

5/8/2024