EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios

Read original: arXiv:2403.02167 - Published 6/14/2024 by Luc'ia G'omez-Zaragoz'a, Roc'io del Amor, Mar'ia Jos'e Castro-Bleda, Valery Naranjo, Mariano Alca~niz Raya, Javier Mar'in-Morales

EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios

Overview

The paper focuses on speech emotion recognition from voice messages recorded in real-world settings, known as "in the wild" conditions.
It introduces the EmovoME database, a new dataset of emotional voice messages recorded from mobile devices.
The researchers developed a deep learning model for speech emotion recognition that is speaker-independent and can handle the challenges of real-world recordings.

Plain English Explanation

The researchers were interested in developing a system that could recognize emotions in voice messages recorded in everyday situations, rather than in a controlled lab environment. This is a challenging task because real-world recordings can be noisy and have a lot of variation in things like the speaker's voice and the background sounds.

To address this, the researchers created a new dataset called EmovoME, which contains voice messages recorded by people using their mobile devices. These messages express a range of emotions, like happiness, sadness, and anger.

The researchers then trained a deep learning model on this dataset to recognize the emotional state of the speaker. Unlike some previous models, theirs is "speaker-independent," which means it can work with voices it hasn't seen before, not just voices it was trained on.

The goal is to create a system that can accurately recognize emotions in real-world voice recordings, which could be useful for things like customer service, mental health monitoring, or analyzing social media interactions.

Technical Explanation

The paper introduces a new dataset called EmovoME that contains over 10,000 emotional voice messages recorded by people using their mobile devices in natural settings. This dataset aims to capture the challenges of real-world speech emotion recognition, such as varying noise levels, speaker diversity, and spontaneous emotional expressions.

The researchers then developed a deep learning model for speech emotion recognition that is designed to handle these challenges. The model uses a multi-task learning approach, where it is trained to simultaneously predict the valence (positivity/negativity) and arousal (intensity) of the emotional state expressed in the voice recording.

The model architecture is based on a pre-trained EmoBox model, which was initially trained on multiple speech emotion datasets. This pre-trained model is then fine-tuned on the EmovoME dataset, allowing the system to learn the characteristics of real-world emotional voice messages.

Importantly, the researchers evaluated their model in a "speaker-independent" setting, where the test speakers were not included in the training data. This is a more realistic and challenging scenario compared to previous studies that used speaker-dependent models. The results show that the model can achieve strong performance in this setting, outperforming several baseline approaches.

Critical Analysis

The researchers have made a valuable contribution by addressing the challenge of speech emotion recognition in real-world, "in the wild" conditions. The EmovoME dataset they created provides a more realistic and diverse set of emotional voice recordings compared to previous datasets recorded in controlled lab environments.

However, the paper does not provide much detail on the limitations of the EmovoME dataset or the model. For example, it's unclear how the dataset was collected, what demographic biases may exist, or how well the model would generalize to different languages, accents, or cultural contexts.

Additionally, the paper does not explore the potential ethical implications of deploying such a system in real-world applications, such as privacy concerns or the risk of misclassifying people's emotional states. These are important considerations that should be addressed in future research.

Overall, the work presented in this paper is a promising step towards more robust and practical speech emotion recognition systems. However, further research is needed to fully understand the capabilities and limitations of the approach, as well as its potential societal impact.

Conclusion

This paper introduces a new dataset and deep learning model for speech emotion recognition in real-world, "in the wild" conditions. The EmovoME dataset provides a more realistic and diverse set of emotional voice recordings compared to previous datasets, and the researchers' speaker-independent deep learning model is able to achieve strong performance on this challenging task.

While this work represents an important step forward, there are still many open questions and areas for future research, such as the scalability of the approach, its generalization to different languages and cultures, and the potential ethical implications of deploying such a system. Nonetheless, the techniques presented in this paper could have valuable applications in fields like customer service, mental health monitoring, and the analysis of social media interactions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios

Luc'ia G'omez-Zaragoz'a, Roc'io del Amor, Mar'ia Jos'e Castro-Bleda, Valery Naranjo, Mariano Alca~niz Raya, Javier Mar'in-Morales

Natural databases for Speech Emotion Recognition (SER) are scarce and often rely on staged scenarios, such as films or television shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) database, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using a standard set of acoustic features and transformer-based models. We compared the results with reference databases including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively on EMOVOME, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in predicting emotion categories, while similar results were obtained in valence and arousal. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations. This study highlights the gap between staged and real-life scenarios, supporting further advancements in recognizing genuine emotions.

6/14/2024

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Luc'ia G'omez Zaragoz'a (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Roc'io del Amor (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Elena Parra Vargas (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Valery Naranjo (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Mariano Alca~niz Raya (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Javier Mar'in-Morales (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain)

Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. Audios were labeled in valence and arousal dimensions by three non-experts and two experts, which were then combined to obtain a final label per dimension. The experts also provided an extra label corresponding to seven emotion categories. To set a baseline for future investigations using EMOVOME, we implemented emotion recognition models using both speech and audio transcriptions. For speech, we used the standard eGeMAPS feature set and support vector machines, obtaining 49.27% and 44.71% unweighted accuracy for valence and arousal respectively. For text, we fine-tuned a multilingual BERT model and achieved 61.15% and 47.43% unweighted accuracy for valence and arousal respectively. This database will significantly contribute to research on emotion recognition in the wild, while also providing a unique natural and freely accessible resource for Spanish.

6/14/2024

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.

6/12/2024

Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis

Elena Ortega-Beltr'an, Josep Cabacas-Maso, Ismael Benito-Altamirano, Carles Ventura

Within the context of creating new Socially Assistive Robots, emotion recognition has become a key development factor, as it allows the robot to adapt to the user's emotional state in the wild. In this work, we focused on the analysis of two voice recording Spanish datasets: ELRA-S0329 and EmoMatchSpanishDB. Specifically, we centered our work in the paralanguage, e.~g. the vocal characteristics that go along with the message and clarifies the meaning. We proposed the use of the DeepSpectrum method, which consists of extracting a visual representation of the audio tracks and feeding them to a pretrained CNN model. For the classification task, DeepSpectrum is often paired with a Support Vector Classifier --DS-SVC--, or a Fully-Connected deep-learning classifier --DS-FC--. We compared the results of the DS-SVC and DS-FC architectures with the state-of-the-art (SOTA) for ELRA-S0329 and EmoMatchSpanishDB. Moreover, we proposed our own classifier based upon Attention Mechanisms, namely DS-AM. We trained all models against both datasets, and we found that our DS-AM model outperforms the SOTA models for the datasets and the SOTA DeepSpectrum architectures. Finally, we trained our DS-AM model in one dataset and tested it in the other, to simulate real-world conditions on how biased is the model to the dataset.

9/10/2024