Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Read original: arXiv:2402.17496 - Published 6/14/2024 by Luc'ia G'omez Zaragoz'a (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Roc'io del Amor (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Elena Parra Vargas (HUMAN-tech Institute, Universitat Polit`enica de Val`encia and 14 others
Total Score

0

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces the Emotional Voice Messages (EMOVOME) database, a dataset of spontaneous voice messages recorded from mobile devices.
  • The goal is to enable research on emotion recognition in natural, conversational speech.
  • The dataset contains over 25 hours of audio from more than 500 speakers, annotated for emotional state.
  • This extends existing speech emotion datasets by capturing the realism and variability of real-world voice messages.

Plain English Explanation

The researchers created a new dataset called EMOVOME that contains recordings of people leaving voice messages on their phones. These messages capture natural, spontaneous speech with real emotional expressions, unlike the carefully scripted recordings used in many previous speech emotion datasets like NEMO and EmoBox.

The EMOVOME dataset has over 25 hours of audio from more than 500 different speakers. Each recording has been annotated by human raters to identify the emotional state of the speaker, such as happy, sad, angry, or neutral. This provides a rich dataset for training and testing emotion recognition models, which could have applications in areas like emotional speech synthesis and improving machine translation.

By using spontaneous voice messages instead of scripted recordings, the researchers aim to capture the natural variability and nuance of human emotional expression, which is crucial for developing robust and generalizable emotion recognition systems.

Technical Explanation

The EMOVOME dataset was collected by having over 500 participants record spontaneous voice messages on their mobile devices. These messages were then annotated by human raters to identify the emotional state of the speaker, using a set of 8 emotional categories (happy, sad, angry, fearful, disgusted, surprised, calm, and neutral).

The dataset contains over 25 hours of audio, with an average message duration of 30 seconds. This is significantly longer than many existing speech emotion datasets, which tend to use shorter, more controlled recordings. The researchers argue that the increased realism and variability of the EMOVOME data will enable the development of more robust and generalizable emotion recognition models compared to prior work.

A set of baseline emotion recognition experiments are performed using standard machine learning techniques like support vector machines and convolutional neural networks. The results demonstrate the challenges of working with spontaneous, naturalistic speech data, with recognition accuracies lower than what has been reported on more curated datasets.

Critical Analysis

The EMOVOME dataset represents an important step forward in speech emotion recognition research by moving beyond the constraints of lab-recorded, scripted data. By capturing the messy realities of real-world voice messages, it highlights the significant challenges that remain in developing emotion recognition systems that can handle the full complexity of human expression.

One key limitation is the reliance on human annotation of emotional state, which can be inherently subjective. The paper acknowledges this, noting that inter-rater agreement was not perfect. This suggests the need for more rigorous and objective methods of emotional labeling, perhaps incorporating physiological or behavioral signals in addition to audio.

Additionally, while the dataset covers a diverse set of speakers and emotional contexts, it is still relatively small compared to the massive datasets used to train the latest deep learning models. Scaling up the collection and annotation process will be crucial to enable further breakthroughs in this domain.

Finally, the paper does not address potential privacy and ethical concerns around the collection and use of personal voice data, which will be an important consideration as this technology continues to advance.

Conclusion

The EMOVOME dataset represents an important advancement in speech emotion recognition research by providing a more naturalistic and challenging dataset than has been available previously. By moving beyond scripted recordings to capture the spontaneity and variability of real-world voice messages, it lays the groundwork for the development of more robust and generalizable emotion recognition models.

While the current results demonstrate the difficulties of working with this type of data, the potential benefits are significant. Accurate emotion recognition in speech could enable a wide range of applications, from improved emotional expression in text-to-speech systems to enhanced understanding of emotional context in machine translation. As research in this area continues to progress, the EMOVOME dataset will serve as an important benchmark and catalyst for advancing the field.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages
Total Score

0

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Luc'ia G'omez Zaragoz'a (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Roc'io del Amor (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Elena Parra Vargas (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Valery Naranjo (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Mariano Alca~niz Raya (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Javier Mar'in-Morales (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain)

Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. Audios were labeled in valence and arousal dimensions by three non-experts and two experts, which were then combined to obtain a final label per dimension. The experts also provided an extra label corresponding to seven emotion categories. To set a baseline for future investigations using EMOVOME, we implemented emotion recognition models using both speech and audio transcriptions. For speech, we used the standard eGeMAPS feature set and support vector machines, obtaining 49.27% and 44.71% unweighted accuracy for valence and arousal respectively. For text, we fine-tuned a multilingual BERT model and achieved 61.15% and 47.43% unweighted accuracy for valence and arousal respectively. This database will significantly contribute to research on emotion recognition in the wild, while also providing a unique natural and freely accessible resource for Spanish.

Read more

6/14/2024

EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios
Total Score

0

EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios

Luc'ia G'omez-Zaragoz'a, Roc'io del Amor, Mar'ia Jos'e Castro-Bleda, Valery Naranjo, Mariano Alca~niz Raya, Javier Mar'in-Morales

Natural databases for Speech Emotion Recognition (SER) are scarce and often rely on staged scenarios, such as films or television shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) database, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using a standard set of acoustic features and transformer-based models. We compared the results with reference databases including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively on EMOVOME, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in predicting emotion categories, while similar results were obtained in valence and arousal. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations. This study highlights the gap between staged and real-life scenarios, supporting further advancements in recognizing genuine emotions.

Read more

6/14/2024

👁️

Total Score

0

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Jinghua Tang, Liyun Zhang, Yu Lu, Dian Ding, Lanqing Yang, YiChao Chen, Minjie Bian, Xiaoshan Li, Guangtao Xue

Emotion recognition can enhance humanized machine responses to user commands, while voiceprint-based perception systems can be easily integrated into commonly used devices like smartphones and stereos. Despite having the largest number of speakers, there is a noticeable absence of high-quality corpus datasets for emotion recognition using Chinese voiceprints. Hence, this paper introduces the VCEMO dataset to address this deficiency. The proposed dataset is constructed from everyday conversations and comprises over 100 users and 7,747 textual samples. Furthermore, this paper proposes a multimodal-based model as a benchmark, which effectively fuses speech, text, and external knowledge using a co-attention structure. The system employs contrastive learning-based regulation for the uneven distribution of the dataset and the diversity of emotional expressions. The experiments demonstrate the significant improvement of the proposed model over SOTA on the VCEMO and IEMOCAP datasets. Code and dataset will be released for research.

Read more

8/26/2024

Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis
Total Score

0

Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis

Elena Ortega-Beltr'an, Josep Cabacas-Maso, Ismael Benito-Altamirano, Carles Ventura

Within the context of creating new Socially Assistive Robots, emotion recognition has become a key development factor, as it allows the robot to adapt to the user's emotional state in the wild. In this work, we focused on the analysis of two voice recording Spanish datasets: ELRA-S0329 and EmoMatchSpanishDB. Specifically, we centered our work in the paralanguage, e.~g. the vocal characteristics that go along with the message and clarifies the meaning. We proposed the use of the DeepSpectrum method, which consists of extracting a visual representation of the audio tracks and feeding them to a pretrained CNN model. For the classification task, DeepSpectrum is often paired with a Support Vector Classifier --DS-SVC--, or a Fully-Connected deep-learning classifier --DS-FC--. We compared the results of the DS-SVC and DS-FC architectures with the state-of-the-art (SOTA) for ELRA-S0329 and EmoMatchSpanishDB. Moreover, we proposed our own classifier based upon Attention Mechanisms, namely DS-AM. We trained all models against both datasets, and we found that our DS-AM model outperforms the SOTA models for the datasets and the SOTA DeepSpectrum architectures. Finally, we trained our DS-AM model in one dataset and tested it in the other, to simulate real-world conditions on how biased is the model to the dataset.

Read more

9/10/2024