MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

Read original: arXiv:2407.05746 - Published 7/9/2024 by Jarod Duret (LIA), Mickael Rouvier (LIA), Yannick Est`eve (LIA)

🗣️

Overview

The authors describe their submission to the 2024 MSP-Podcast Speech Emotion Recognition (SER) Challenge.
The challenge has two tasks: Categorical Emotion Recognition and Emotional Attribute Prediction.
The authors focused on Task 1, which involves classifying eight emotional states using the MSP-Podcast dataset.
Their approach uses an ensemble of models, each trained independently and then combined using a Support Vector Machine (SVM) classifier.
The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach.
This joint training methodology aims to enhance the system's ability to accurately classify emotional states.

Plain English Explanation

The researchers worked on a challenge focused on recognizing emotions in podcasts. The challenge had two parts: identifying specific emotions and predicting emotional attributes. The researchers focused on the first part, which involved classifying eight different emotions using a dataset of podcast recordings.

Their approach used a combination of multiple models, each trained independently. These models were then brought together using a machine learning technique called a Support Vector Machine (SVM) classifier. The models were trained in different ways, including using self-supervised learning to extract useful information from the speech and text data. The goal of this joint training approach was to help the system better recognize the emotional states in the podcast recordings.

Technical Explanation

The authors' approach employed an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology, similar to the approach in EmoBox and Emotion-Aware Speech Self-Supervised Representation Learning, aims to enhance the system's ability to accurately classify emotional states.

Critical Analysis

The paper provides a comprehensive overview of the authors' approach to the SER challenge, but it does not delve into the specific details of the model architectures or the performance of the individual models. Additionally, the paper does not discuss any potential limitations or caveats of the proposed approach, such as the impact of dataset bias or the generalization of the models to different speech corpora.

Furthermore, the authors mention that their system obtained an F1-macro score of 0.35% on the development set, which is a relatively low performance compared to the state-of-the-art results reported in the literature. It would be helpful if the authors could provide a more in-depth analysis of the factors contributing to this performance and suggest potential avenues for improvement.

Conclusion

The authors have presented a novel approach to the MSP-Podcast SER challenge, leveraging an ensemble of models trained using various self-supervised learning strategies. While the results on the development set are modest, the proposed methodology has the potential to enhance emotion recognition capabilities in speech-based applications. Further research is needed to optimize the model architectures, explore more advanced fusion techniques, and investigate the generalization of the approach to other speech emotion recognition datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

Jarod Duret (LIA), Mickael Rouvier (LIA), Yannick Est`eve (LIA)

In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. Thus, the system obtained F1-macro of 0.35% on development set.

7/9/2024

Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations

Bulat Khaertdinov, Pedro Jeuris, Annanda Sousa, Enrique Hortal

Recent advancements in Deep and Self-Supervised Learning (SSL) have led to substantial improvements in Speech Emotion Recognition (SER) performance, reaching unprecedented levels. However, obtaining sufficient amounts of accurately labeled data for training or fine-tuning the models remains a costly and challenging task. In this paper, we propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models, to improve SER performance in scenarios where annotations are limited. Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall, in settings with extremely sparse data annotations.

6/13/2024

BSC-UPC at EmoSPeech-IberLEF2024: Attention Pooling for Emotion Recognition

Marc Casals-Salvador, Federico Costa, Miquel India, Javier Hernando

The domain of speech emotion recognition (SER) has persistently been a frontier within the landscape of machine learning. It is an active field that has been revolutionized in the last few decades and whose implementations are remarkable in multiple applications that could affect daily life. Consequently, the Iberian Languages Evaluation Forum (IberLEF) of 2024 held a competitive challenge to leverage the SER results with a Spanish corpus. This paper presents the approach followed with the goal of participating in this competition. The main architecture consists of different pre-trained speech and text models to extract features from both modalities, utilizing an attention pooling mechanism. The proposed system has achieved the first position in the challenge with an 86.69% in Macro F1-Score.

7/18/2024

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

Andreas Triantafyllopoulos, Anton Batliner, Simon Rampp, Manuel Milling, Bjorn Schuller

We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge -- and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid search. Results are always reported on the official test set with a separate validation set only used for early stopping. Most models score below or close to the official baseline, while they marginally outperform the original challenge winners after hyperparameter tuning. Our work illustrates that, despite recent progress, FAU-AIBO remains a very challenging benchmark. An interesting corollary is that newer methods do not consistently outperform older ones, showing that progress towards `solving' SER is not necessarily monotonic.

6/11/2024