VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Read original: arXiv:2408.13019 - Published 8/26/2024 by Jinghua Tang, Liyun Zhang, Yu Lu, Dian Ding, Lanqing Yang, YiChao Chen, Minjie Bian, Xiaoshan Li, Guangtao Xue

👁️

Overview

The paper introduces a new Chinese voiceprint dataset called VCEMO for emotion recognition.
It proposes a multimodal-based model that effectively fuses speech, text, and external knowledge to recognize emotions.
The model uses contrastive learning-based regulation to address the uneven distribution of the dataset and diversity of emotional expressions.
Experiments show significant improvement over state-of-the-art on VCEMO and IEMOCAP datasets.

Plain English Explanation

The paper focuses on improving how machines can understand and respond to human emotions. It does this by introducing a new dataset of Chinese voiceprints (recordings of people's voices) that can be used to train AI systems to recognize emotions.

Currently, there is a lack of high-quality datasets for emotion recognition using Chinese voiceprints, even though Chinese is one of the most widely spoken languages. The VCEMO dataset introduced in this paper aims to address this gap. It contains over 100 users and 7,747 text samples from everyday conversations.

To make use of this new dataset, the paper also proposes a multimodal-based model that combines information from speech, text, and external knowledge to recognize emotions. This model uses a co-attention structure to effectively fuse these different inputs. It also employs contrastive learning-based regulation to handle the uneven distribution of the dataset and the diversity of emotional expressions.

The experiments show that this new model significantly outperforms the current state-of-the-art approaches on both the VCEMO dataset and the IEMOCAP dataset (another benchmark for emotion recognition). This suggests the proposed approach is a promising way to build AI systems that can better understand and respond to human emotions, especially for Chinese speakers.

Technical Explanation

The paper introduces the VCEMO dataset, a new Chinese voiceprint corpus for emotion recognition. This dataset was constructed from everyday conversations and contains over 100 users and 7,747 textual samples. It aims to address the lack of high-quality Chinese emotion recognition datasets.

To leverage this new dataset, the paper proposes a multimodal-based model that fuses speech, text, and external knowledge using a co-attention structure. This allows the model to effectively combine information from different modalities to recognize emotions.

The model also employs contrastive learning-based regulation to address two key challenges:

The uneven distribution of emotional samples in the dataset
The diversity of emotional expressions

By using contrastive learning, the model is better able to learn discriminative features that capture the nuances of different emotional states.

Experiments on the VCEMO and IEMOCAP datasets demonstrate that the proposed model significantly outperforms state-of-the-art approaches. This suggests the model is a promising approach for building AI systems that can understand and respond to human emotions, especially for Chinese speakers.

Critical Analysis

The paper makes a valuable contribution by introducing the VCEMO dataset, which helps address the lack of high-quality Chinese emotion recognition datasets. This is an important step forward, as Chinese is one of the most widely spoken languages globally.

However, the paper does not provide much detail on the process of constructing the VCEMO dataset. It would be helpful to know more about the data collection methods, the demographics of the participants, and any potential biases or limitations of the dataset.

Additionally, while the proposed multimodal model shows strong performance, the paper does not extensively compare it to other multimodal approaches. It would be interesting to see how the model's architecture and training strategy compare to other recent advances in multimodal emotion recognition.

Finally, the paper does not discuss the potential real-world applications or societal implications of this technology. As emotion recognition systems become more advanced, it will be important to consider ethical concerns, such as privacy, bias, and the impact on human-AI interaction.

Conclusion

This paper introduces a new Chinese voiceprint dataset, VCEMO, and a multimodal-based model that effectively combines speech, text, and external knowledge to recognize emotions. The model's use of contrastive learning-based regulation helps address the challenges of uneven dataset distribution and diverse emotional expressions.

Experiments demonstrate the proposed approach significantly outperforms state-of-the-art methods on both the VCEMO and IEMOCAP datasets. This suggests the model is a promising step towards building AI systems that can better understand and respond to human emotions, particularly for Chinese-speaking users.

While the paper makes valuable contributions, further research is needed to address potential limitations and explore the broader societal implications of this technology. Overall, this work represents an important advancement in the field of emotion recognition and its integration into human-machine interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Jinghua Tang, Liyun Zhang, Yu Lu, Dian Ding, Lanqing Yang, YiChao Chen, Minjie Bian, Xiaoshan Li, Guangtao Xue

Emotion recognition can enhance humanized machine responses to user commands, while voiceprint-based perception systems can be easily integrated into commonly used devices like smartphones and stereos. Despite having the largest number of speakers, there is a noticeable absence of high-quality corpus datasets for emotion recognition using Chinese voiceprints. Hence, this paper introduces the VCEMO dataset to address this deficiency. The proposed dataset is constructed from everyday conversations and comprises over 100 users and 7,747 textual samples. Furthermore, this paper proposes a multimodal-based model as a benchmark, which effectively fuses speech, text, and external knowledge using a co-attention structure. The system employs contrastive learning-based regulation for the uneven distribution of the dataset and the diversity of emotional expressions. The experiments demonstrate the significant improvement of the proposed model over SOTA on the VCEMO and IEMOCAP datasets. Code and dataset will be released for research.

8/26/2024

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Anbin QI, Zhongliang Liu, Xinyong Zhou, Jinba Xiao, Fengrun Zhang, Qi Gan, Ming Tao, Gaozheng Zhang, Lu Zhang

In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.

9/12/2024

EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios

Luc'ia G'omez-Zaragoz'a, Roc'io del Amor, Mar'ia Jos'e Castro-Bleda, Valery Naranjo, Mariano Alca~niz Raya, Javier Mar'in-Morales

Natural databases for Speech Emotion Recognition (SER) are scarce and often rely on staged scenarios, such as films or television shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) database, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using a standard set of acoustic features and transformer-based models. We compared the results with reference databases including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively on EMOVOME, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in predicting emotion categories, while similar results were obtained in valence and arousal. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations. This study highlights the gap between staged and real-life scenarios, supporting further advancements in recognizing genuine emotions.

6/14/2024

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages

Luc'ia G'omez Zaragoz'a (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Roc'io del Amor (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Elena Parra Vargas (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Valery Naranjo (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Mariano Alca~niz Raya (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain), Javier Mar'in-Morales (HUMAN-tech Institute, Universitat Polit`enica de Val`encia, Valencia, Spain)

Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. Audios were labeled in valence and arousal dimensions by three non-experts and two experts, which were then combined to obtain a final label per dimension. The experts also provided an extra label corresponding to seven emotion categories. To set a baseline for future investigations using EMOVOME, we implemented emotion recognition models using both speech and audio transcriptions. For speech, we used the standard eGeMAPS feature set and support vector machines, obtaining 49.27% and 44.71% unweighted accuracy for valence and arousal respectively. For text, we fine-tuned a multilingual BERT model and achieved 61.15% and 47.43% unweighted accuracy for valence and arousal respectively. This database will significantly contribute to research on emotion recognition in the wild, while also providing a unique natural and freely accessible resource for Spanish.

6/14/2024