GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT

Read original: arXiv:2405.02151 - Published 9/24/2024 by Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

🔄

Overview

This paper presents a novel adaptive transfer learning framework called GMP-ATL for improving Speech Emotion Recognition (SER) performance.
GMP-ATL builds on the pre-trained HuBERT model, leveraging multi-task learning and multi-scale k-means clustering to acquire gender-augmented multi-scale pseudo-labels.
The framework then incorporates model retraining and fine-tuning methods to optimize the use of both frame-level and utterance-level emotion labels.
Experiments on the IEMOCAP dataset show that GMP-ATL achieves state-of-the-art unimodal SER performance, with a Weighted Accuracy Rate (WAR) of 80.0% and an Unweighted Accuracy Rate (UAR) of 82.0%.

Plain English Explanation

Speech Emotion Recognition (SER) is the process of identifying the emotional state of a person based on their speech. Researchers have been continuously improving SER models by building on pre-trained speech models, such as HuBERT.

In this paper, the authors present a new approach called GMP-ATL, which stands for Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning. The key idea is to leverage the pre-trained HuBERT model and enhance it through several techniques:

Multi-task Learning: The model is trained to not only recognize emotions but also identify the speaker's gender. This helps the model learn more nuanced features related to emotion expression.
Multi-scale Clustering: The model divides the speech signal into frames at different time scales and uses a clustering algorithm to assign pseudo-labels to each frame. This allows the model to capture emotions at different levels of granularity.
Adaptive Transfer Learning: The model is first trained on the pseudo-labels, then fine-tuned using the actual emotion labels in the dataset. This helps the model better utilize both the automatically generated and the ground-truth emotion labels.

The researchers tested GMP-ATL on the IEMOCAP dataset, a widely used benchmark for SER. They found that their approach outperformed other state-of-the-art unimodal (single-modality) SER methods, achieving a Weighted Accuracy Rate of 80.0% and an Unweighted Accuracy Rate of 82.0%. This means the model can accurately recognize emotions in speech, even when the dataset has imbalanced emotion categories.

Technical Explanation

The authors of this paper propose a novel adaptive transfer learning framework called GMP-ATL for improving Speech Emotion Recognition (SER) performance. The framework builds upon the pre-trained HuBERT model, which has shown promising results in various speech-related tasks.

The key components of GMP-ATL are:

Multi-task Learning: The model is trained to not only recognize emotions but also identify the speaker's gender. This helps the model learn more relevant features for emotion recognition, as gender is known to play a role in emotional expression.
Multi-scale Pseudo-label Acquisition: The model divides the speech signal into frames at different time scales (e.g., 100ms, 300ms, 500ms) and applies k-means clustering to assign pseudo-labels to each frame. This allows the model to capture emotions at different levels of granularity.
Adaptive Transfer Learning: The model is first trained on the pseudo-labels obtained in the previous step, then fine-tuned using the actual emotion labels in the dataset. This helps the model better leverage both the automatically generated and the ground-truth emotion labels.

The authors evaluated GMP-ATL on the IEMOCAP dataset, a widely used benchmark for SER. Experiments showed that GMP-ATL achieves state-of-the-art performance among unimodal (single-modality) SER methods, with a Weighted Accuracy Rate (WAR) of 80.0% and an Unweighted Accuracy Rate (UAR) of 82.0%. These results are also comparable to those of multimodal SER approaches that utilize additional modalities, such as facial expressions or body gestures.

Critical Analysis

The authors of this paper have presented a well-designed and comprehensive approach to improving SER performance using adaptive transfer learning. The incorporation of multi-task learning and multi-scale pseudo-label acquisition is a novel and promising direction that leverages the strengths of the pre-trained HuBERT model.

One potential limitation of the study is the reliance on the IEMOCAP dataset, which, while widely used, may not capture the full complexity and diversity of real-world emotional expression in speech. Additionally, the paper does not explore the robustness of GMP-ATL to noisy or low-quality speech data, which is an important consideration for practical applications.

Further research could investigate the performance of GMP-ATL on a wider range of datasets, including those with more diverse speakers and emotional categories. Additionally, exploring the interaction between different modalities (e.g., speech, facial expressions, body language) could lead to even more robust and accurate SER systems.

Conclusion

This paper presents a novel adaptive transfer learning framework called GMP-ATL that significantly improves the performance of Speech Emotion Recognition (SER) systems. By leveraging the pre-trained HuBERT model and incorporating multi-task learning, multi-scale pseudo-label acquisition, and adaptive fine-tuning, the authors have developed a state-of-the-art unimodal SER approach.

The results on the IEMOCAP dataset demonstrate the effectiveness of GMP-ATL, with the framework achieving a Weighted Accuracy Rate of 80.0% and an Unweighted Accuracy Rate of 82.0%. These findings highlight the potential of adaptive transfer learning techniques to enhance the recognition of emotions in speech, which has important applications in areas such as human-computer interaction, mental health monitoring, and customer service.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT

Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.

9/24/2024

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

Ohad Cohen, Gershon Hazan, Sharon Gannot

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.

9/17/2024

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

Pujin Shi, Fei Gao

In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.

9/10/2024

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Zhixian Zhao, Haifeng Chen, Xi Li, Dongmei Jiang, Lei Xie

Multimodal Emotion Recognition (MER) aims to automatically identify and understand human emotional states by integrating information from various modalities. However, the scarcity of annotated multimodal data significantly hinders the advancement of this research field. This paper presents our solution for the MER-SEMI sub-challenge of MER 2024. First, to better adapt acoustic modality features for the MER task, we experimentally evaluate the contributions of different layers of the pre-trained speech model HuBERT in emotion recognition. Based on these observations, we perform Parameter-Efficient Fine-Tuning (PEFT) on the layers identified as most effective for emotion recognition tasks, thereby achieving optimal adaptation for emotion recognition with a minimal number of learnable parameters. Second, leveraging the strengths of the acoustic modality, we propose a feature alignment pre-training method. This approach uses large-scale unlabeled data to train a visual encoder, thereby promoting the semantic alignment of visual features within the acoustic feature space. Finally, using the adapted acoustic features, aligned visual features, and lexical features, we employ an attention mechanism for feature fusion. On the MER2024-SEMI test set, the proposed method achieves a weighted F1 score of 88.90%, ranking fourth among all participating teams, validating the effectiveness of our approach.

9/11/2024