Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

2403.11879

Published 6/18/2024 by Tobias Hallmen, Fabian Deuser, Norbert Oswald, Elisabeth Andr'e

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Abstract

In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.

Create account to get full access

Overview

This paper presents a novel approach for predicting emotional mimicry, a phenomenon where people unconsciously mirror the emotional expressions of others.
The authors propose a unimodal multi-task fusion model that combines information from multiple modalities (such as facial expressions, body language, and speech) to improve the accuracy of emotional mimicry prediction.
The model is trained and evaluated on a dataset of human interactions, demonstrating its effectiveness in capturing the nuances of emotional mimicry.

Plain English Explanation

When we interact with others, we often unconsciously mirror their emotional expressions. For example, if someone is smiling, we may find ourselves smiling back without even realizing it. This phenomenon is known as emotional mimicry, and it plays a crucial role in social interactions and human connection.

The authors of this paper have developed a new system that can predict emotional mimicry more accurately than previous methods. Their approach, called a unimodal multi-task fusion model, takes into account multiple sources of information, such as facial expressions, body language, and speech, to better understand the subtle cues that contribute to emotional mimicry.

By combining these different modalities, the model can capture a more holistic understanding of the emotional dynamics at play during human interactions. This improved accuracy in predicting emotional mimicry could have important applications in fields like human-computer interaction, emotion recognition, and social robotics, where understanding and responding to emotional cues is crucial.

Technical Explanation

The authors' unimodal multi-task fusion model is designed to leverage information from multiple modalities, including facial expressions, body language, and speech, to improve the accuracy of emotional mimicry prediction. The model consists of several key components:

Unimodal Feature Extraction: The model starts by extracting relevant features from each modality independently, using specialized neural network architectures tailored to the characteristics of each data type.
Multi-Task Learning: The model is trained to perform multiple related tasks simultaneously, such as predicting the intensity of different emotional expressions and the likelihood of emotional mimicry. This multi-task approach allows the model to learn more robust and generalizable representations.
Fusion Module: The extracted features from each modality are then combined using a fusion module, which learns to integrate the information from different sources effectively. This fusion step is crucial for capturing the complex interplay between various emotional cues.

The authors evaluate their model on a dataset of human interactions, where participants were recorded while engaged in conversational tasks. The results demonstrate that the unimodal multi-task fusion model outperforms previous approaches in predicting emotional mimicry, highlighting the benefits of the proposed architecture.

Critical Analysis

The authors provide a thorough evaluation of their model's performance, including comparisons to various baseline methods. However, the paper does not delve deeply into the potential limitations or caveats of the proposed approach.

One area that could benefit from further exploration is the model's generalizability to different cultural contexts or interactions with diverse populations. Emotional expression and mimicry can be heavily influenced by cultural norms and individual differences, which may not be fully captured by the current dataset.

Additionally, the paper does not address potential ethical concerns related to the use of such emotion recognition technologies, such as privacy implications or the potential for biased or discriminatory predictions. As these systems become more prevalent, it is crucial to consider the societal impact and ensure they are developed and deployed responsibly.

Conclusion

The unimodal multi-task fusion model presented in this paper represents a significant advancement in the field of emotional mimicry prediction. By leveraging multiple modalities and a multi-task learning approach, the model can capture the nuanced interplay of various emotional cues, leading to improved accuracy in predicting this fundamental aspect of human interaction.

The potential applications of this research are vast, ranging from improving human-computer interaction to enhancing emotion recognition and social robotics. As the field of multimodal fusion continues to evolve, this work sets the stage for further advancements in understanding and modeling the complex emotional dynamics that shape our social interactions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌐

A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product

Ao Xiang, Zongqing Qi, Han Wang, Qin Yang, Danqing Ma

This paper introduces a new multi-modal model based on the Transformer architecture and tensor product fusion strategy, combining BERT's text vectors and ViT's image vectors to classify students' psychological conditions, with an accuracy of 93.65%. The purpose of the study is to accurately analyze the mental health status of students from various data sources. This paper discusses modal fusion methods, including early, late and intermediate fusion, to overcome the challenges of integrating multi-modal information. Ablation studies compare the performance of different models and fusion techniques, showing that the proposed model outperforms existing methods such as CLIP and ViLBERT in terms of accuracy and inference speed. Conclusions indicate that while this model has significant advantages in emotion recognition, its potential to incorporate other data modalities provides areas for future research.

4/22/2024

cs.CV

Joint Multimodal Transformer for Emotion Recognition in the Wild

Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

4/23/2024

cs.CV cs.LG cs.SD eess.AS

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

6/18/2024

cs.AI cs.MM

Graph-based multi-Feature fusion method for speech emotion recognition

Xueyu Liu, Jie Lin, Chao Wang

Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.

6/14/2024

cs.SD eess.AS