MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

2404.17113

Published 5/24/2024 by Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen and 8 others

cs.LG cs.HC

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Abstract

Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefore, we organize a series of challenges around emotion recognition to further promote the development of this area. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model's ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.

Create account to get full access

Overview

This paper proposes a new multimodal emotion recognition (MER) system that uses semi-supervised learning, is robust to noise, and can handle open-vocabulary emotions.
The system leverages both labeled and unlabeled data to improve performance, and is designed to work well even when the input data is noisy or incomplete.
The model can recognize a wide range of emotions, including those not seen during training, making it more versatile than previous approaches.

Plain English Explanation

The researchers have developed a new system for automatically recognizing emotions from multiple data sources, such as speech, facial expressions, and body language. Traditional emotion recognition systems often struggle when the input data is noisy or incomplete, or when they encounter new types of emotions not used in training.

This new system addresses those challenges in a few key ways:

Semi-Supervised Learning: The model uses both labeled training data (where the emotions are already identified) and unlabeled data (where the emotions are unknown). By learning patterns from both types of data, the system can become more accurate and robust.
Noise Robustness: The system is designed to work well even when the input data is imperfect or has interference. This allows it to be used in real-world scenarios where the data may not be pristine.
Open-Vocabulary Emotion Recognition: The model can recognize a wide range of emotions, including ones that were not explicitly included in the training data. This makes the system more flexible and adaptable to different contexts.

By combining these innovations, the researchers have created a multimodal emotion recognition system that is more accurate, reliable, and versatile than previous approaches. This could have important applications in areas like human-computer interaction, mental health monitoring, and customer service.

Technical Explanation

The paper introduces a new Multimodal Emotion Recognition (MER) system that leverages semi-supervised learning, noise robustness, and open-vocabulary emotion recognition.

The core architecture of the model is a Multimodal Transformer that can fuse information from different input modalities, such as audio, video, and text. To enable semi-supervised learning, the system is trained on both labeled data (where the emotions are known) and unlabeled data (where the emotions are unknown).

The noise robustness is achieved through a dynamic modality selection mechanism that can adaptively choose the most reliable input modalities based on the quality of the data. This allows the system to maintain high performance even when some of the input signals are noisy or missing.

Finally, the open-vocabulary emotion recognition is enabled by a semantic-aware emotion classifier that can recognize a wide range of emotion categories, including ones not seen during training. This is achieved by leveraging external knowledge bases and language models.

The authors evaluate their system on several benchmark datasets, including MERBENCH, and show that it outperforms previous state-of-the-art approaches in terms of accuracy, robustness, and flexibility.

Critical Analysis

The paper presents a compelling approach to addressing some of the key challenges in multimodal emotion recognition. The use of semi-supervised learning, noise robustness, and open-vocabulary recognition are all important contributions that can significantly improve the real-world applicability of emotion recognition systems.

However, the paper does not fully address the potential ethical and privacy concerns associated with such technology. While the authors mention the need for responsible development, they do not delve into the implications of deploying a system that can recognize emotions from diverse data sources. There are valid concerns around user consent, data privacy, and the potential for misuse or abuse of this technology.

Additionally, the paper does not provide a thorough analysis of the limitations of the proposed approach. For example, it is unclear how well the system would perform in highly complex or ambiguous emotional scenarios, or how it would handle cultural differences in emotion expression. Further research and evaluation in these areas would be valuable.

Overall, the technical innovations presented in this paper are impressive, but the authors should also devote more attention to the broader societal and ethical considerations surrounding the deployment of such powerful emotion recognition systems.

Conclusion

The researchers have developed a multimodal emotion recognition system that addresses several key challenges in the field, including the need for semi-supervised learning, noise robustness, and open-vocabulary emotion recognition. By leveraging a Multimodal Transformer architecture and dynamic modality selection, the system can achieve high accuracy and flexibility in recognizing a wide range of emotions, even in the presence of noisy or incomplete input data.

This research represents an important step forward in making emotion recognition systems more practical and useful in real-world applications, such as human-computer interaction, mental health monitoring, and customer service. However, the authors should also consider the potential ethical and privacy implications of deploying such powerful technology, and explore ways to ensure its responsible development and deployment.

Overall, this paper makes a valuable contribution to the field of multimodal emotion recognition, and the ideas presented here could have a significant impact on the future of this technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

Zheng Lian, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, Jianhua Tao

Multimodal emotion recognition plays a crucial role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have proposed a series of algorithms and achieved impressive progress. Although each method shows its superior performance, different methods lack a fair comparison due to inconsistencies in feature extractors, evaluation manners, and experimental settings. These inconsistencies severely hinder the development of this field. Therefore, we build MERBench, a unified evaluation benchmark for multimodal emotion recognition. We aim to reveal the contribution of some important techniques employed in previous works, such as feature selection, multimodal fusion, robustness analysis, fine-tuning, pre-training, etc. We hope this benchmark can provide clear and comprehensive guidance for follow-up researchers. Based on the evaluation results of MERBench, we further point out some promising research directions. Additionally, we introduce a new emotion dataset MER2023, focusing on the Chinese language environment. This dataset can serve as a benchmark dataset for research on multi-label learning, noise robustness, and semi-supervised learning. We encourage the follow-up researchers to evaluate their algorithms under the same experimental setup as MERBench for fair comparisons. Our code is available at: https://github.com/zeroQiaoba/MERTools.

4/23/2024

cs.HC

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

6/18/2024

cs.AI cs.MM

👁️

Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios

Qi Fan (Inner Mongolia University, Hohhot, China), Haolin Zuo (Inner Mongolia University, Hohhot, China), Rui Liu (Inner Mongolia University, Hohhot, China), Zheng Lian (Institute of Automation, Chinese Academy of Sciences, Beijing, China), Guanglai Gao (Inner Mongolia University, Hohhot, China)

Multimodal emotion recognition (MER) in practical scenarios is significantly challenged by the presence of missing or incomplete data across different modalities. To overcome these challenges, researchers have aimed to simulate incomplete conditions during the training phase to enhance the system's overall robustness. Traditional methods have often involved discarding data or substituting data segments with zero vectors to approximate these incompletenesses. However, such approaches neither accurately represent real-world conditions nor adequately address the issue of noisy data availability. For instance, a blurry image cannot be simply replaced with zero vectors, and still retain information. To tackle this issue and develop a more precise MER system, we introduce a novel noise-robust MER model that effectively learns robust multimodal joint representations from noisy data. This approach includes two pivotal components: firstly, a noise scheduler that adjusts the type and level of noise in the data to emulate various realistic incomplete situations. Secondly, a Variational AutoEncoder (VAE)-based module is employed to reconstruct these robust multimodal joint representations from the noisy inputs. Notably, the introduction of the noise scheduler enables the exploration of an entirely new type of incomplete data condition, which is impossible with existing methods. Extensive experimental evaluations on the benchmark datasets IEMOCAP and CMU-MOSEI demonstrate the effectiveness of the noise scheduler and the excellent performance of our proposed model.

5/8/2024

cs.CV cs.AI cs.LG

Joint Multimodal Transformer for Emotion Recognition in the Wild

Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

4/23/2024

cs.CV cs.LG cs.SD eess.AS