MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

2404.18327

Published 5/17/2024 by Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

👁️

Abstract

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

Create account to get full access

Overview

The paper proposes a novel approach called Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER) for processing multimodal data to recognize dynamic emotions.
MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities.
The model is trained through simple, straightforward fine-tuning of a pre-trained masked autoencoder.
The performance of MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences to address dynamic feature correlations across spatial, temporal, and spatiotemporal sequences.

Plain English Explanation

The researchers have developed a new method called MultiMAE-DER to recognize emotions from a combination of visual and audio data. This method builds on the idea of a masked autoencoder, which is a type of machine learning model that can learn useful representations from data by attempting to reconstruct parts of the input that have been hidden or "masked" out.

By using a pre-trained masked autoencoder as a starting point and then fine-tuning it on the specific task of emotion recognition, the researchers were able to create a model that performs better than previous state-of-the-art methods. The key insight is that the visual and audio data used to express emotions are closely linked, and by capturing the relationships between them, the model can make more accurate predictions.

To further improve the performance, the researchers explored different ways of combining the visual and audio information, using strategies that consider the dynamic nature of the data, such as how features change over time and space. This allows the model to better understand the complex patterns in the multimodal data and make more accurate emotion predictions.

Technical Explanation

The MultiMAE-DER model is built on the idea of a masked autoencoder, which is a type of self-supervised learning approach that has been shown to be effective for learning useful representations from data. By masking out parts of the input and training the model to reconstruct the missing information, the autoencoder can capture the underlying patterns and relationships in the data.

In the case of MultiMAE-DER, the researchers leverage this approach to learn a joint representation of visual and audio data for the task of dynamic emotion recognition. They fine-tune the pre-trained masked autoencoder model on the emotion recognition task, which allows the model to adapt the learned representations to the specific problem at hand.

To further enhance the performance of the model, the researchers explore six different fusion strategies to combine the visual and audio information. These strategies address the dynamic nature of the data, considering spatial, temporal, and spatiotemporal correlations across the multimodal input sequences.

The results show that MultiMAE-DER outperforms the state-of-the-art supervised and self-supervised multimodal models for dynamic emotion recognition on several benchmark datasets, demonstrating the effectiveness of the proposed approach.

Critical Analysis

The paper presents a well-designed and thorough study, with a clear focus on addressing the challenges of dynamic emotion recognition from multimodal data. The use of a pre-trained masked autoencoder as a starting point is a clever approach, as it allows the model to leverage the general representation learning capabilities of this type of architecture.

One potential limitation of the study is that it only considers visual and audio modalities. It would be interesting to see how the MultiMAE-DER approach would perform with the addition of other modalities, such as physiological signals or text, which could provide further insights into the emotional state of the individual.

Additionally, the paper does not delve into the interpretability of the learned representations or the specific mechanisms by which the model is able to capture the dynamic relationships between the visual and audio data. Further analysis in this area could shed light on the inner workings of the MultiMAE-DER and potentially lead to additional improvements or insights.

Overall, the MultiMAE-DER approach represents a significant contribution to the field of multimodal emotion recognition and demonstrates the value of leveraging self-supervised learning techniques for this challenge.

Conclusion

The MultiMAE-DER model proposed in this paper presents a novel and effective approach to dynamic emotion recognition from multimodal data. By building on the powerful representation learning capabilities of masked autoencoders and exploring strategic fusion of visual and audio information, the researchers have developed a model that outperforms previous state-of-the-art methods.

This work highlights the potential of self-supervised learning techniques, such as masked autoencoders, to tackle complex multimodal problems. The insights gained from this research could pave the way for further advancements in emotion recognition and other areas of multimodal learning, with applications in fields like human-computer interaction, mental health monitoring, and affective computing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild

Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj

Dynamic Facial Expression Recognition (DFER) has received significant interest in the recent years dictated by its pivotal role in enabling empathic and human-compatible technologies. Achieving robustness towards in-the-wild data in DFER is particularly important for real-world applications. One of the directions aimed at improving such models is multimodal emotion recognition based on audio and video data. Multimodal learning in DFER increases the model capabilities by leveraging richer, complementary data representations. Within the field of multimodal DFER, recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. Another line of research has focused on adapting pre-trained static models for DFER. In this work, we propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders. We identify main challenges associated with this task, namely, intra-modality adaptation, cross-modal alignment, and temporal adaptation, and propose solutions to each of them. As a result, we demonstrate improvement over current state-of-the-art on two popular DFER benchmarks, namely DFEW and MFAW.

4/16/2024

cs.CV cs.LG

🗣️

A vector quantized masked autoencoder for audiovisual speech emotion recognition

Samir Sadok, Simon Leglaive, Renaud S'eguier

The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by vector quantized variational autoencoders. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence, which are then used for an SER downstream task. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods. Extensive ablation experiments are also provided to assess the contribution of the different model components.

5/16/2024

cs.SD cs.LG cs.MM eess.AS

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

Mahsa Ehsanpour, Ian Reid, Hamid Rezatofighi

For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints' trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.

4/9/2024

cs.CV

Self-supervised Pre-training for Transferable Multi-modal Perception

Xiaohao Xu, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang

In autonomous driving, multi-modal perception models leveraging inputs from multiple sensors exhibit strong robustness in degraded environments. However, these models face challenges in efficiently and effectively transferring learned representations across different modalities and tasks. This paper presents NeRF-Supervised Masked Auto Encoder (NS-MAE), a self-supervised pre-training paradigm for transferable multi-modal representation learning. NS-MAE is designed to provide pre-trained model initializations for efficient and high-performance fine-tuning. Our approach uses masked multi-modal reconstruction in neural radiance fields (NeRF), training the model to reconstruct missing or corrupted input data across multiple modalities. Specifically, multi-modal embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on specific view directions and locations. These embeddings are then rendered into projected multi-modal feature maps using neural rendering techniques. The original multi-modal signals serve as reconstruction targets for the rendered feature maps, facilitating self-supervised representation learning. Extensive experiments demonstrate the promising transferability of NS-MAE representations across diverse multi-modal and single-modal perception models. This transferability is evaluated on various 3D perception downstream tasks, such as 3D object detection and BEV map segmentation, using different amounts of fine-tuning labeled data. Our code will be released to support the community.

5/29/2024

cs.CV cs.AI cs.RO