Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Read original: arXiv:2409.07078 - Published 9/12/2024 by Anbin QI, Zhongliang Liu, Xinyong Zhou, Jinba Xiao, Fengrun Zhang, Qi Gan, Ming Tao, Gaozheng Zhang, Lu Zhang

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Overview

This paper proposes a multimodal emotion recognition (MER) method that combines vision-language prompting and modality dropout.
The method fine-tunes CLIP, a pre-trained vision-language model, to perform MER tasks.
Modality dropout is used during training to improve robustness to missing modalities at inference time.
The approach achieved state-of-the-art results on several MER benchmarks.

Plain English Explanation

The paper describes a new way to recognize emotions from multiple types of information, such as images and text. The researchers used a powerful vision-language model called CLIP that was trained on a huge amount of online data. They fine-tuned CLIP to get it to recognize emotions, and also used a technique called modality dropout during training.

Modality dropout means randomly hiding some of the input information (like the image or the text) during training. This helps the model learn to be robust and still make accurate predictions even when some information is missing.

The researchers showed that this approach works very well, achieving the best results so far on several standard emotion recognition benchmarks. This suggests it could be a powerful tool for building emotion AI systems that can handle real-world messiness and incompleteness of data.

Technical Explanation

The paper proposes a multimodal emotion recognition (MER) method that combines vision-language prompting and modality dropout.

Vision-language prompting: The method fine-tunes the pre-trained CLIP model for MER tasks. CLIP is a powerful vision-language model that has been trained on a large amount of image-text data. By fine-tuning CLIP, the model can leverage the rich multimodal representations learned during pre-training.

Modality dropout: During training, the method randomly drops out (i.e. hides) input modalities like images or text. This forces the model to learn robust representations that can handle missing modalities at inference time. The modality dropout ratio is annealed over training epochs.

The model is trained in a semi-supervised fashion, using both labeled and unlabeled data. The unlabeled data is used to pre-train the model's multimodal encoder, while the labeled data is used for the final fine-tuning.

Experiments on several MER benchmarks show that this approach achieves state-of-the-art results, outperforming previous methods that did not use vision-language prompting or modality dropout.

Critical Analysis

The paper provides a compelling approach to improving multimodal emotion recognition by leveraging powerful pre-trained vision-language models and using modality dropout to improve robustness.

However, the paper does not discuss some potential limitations or areas for further research:

The performance gain from modality dropout may be dependent on the specific dataset and distribution of missing modalities. Further analysis is needed to understand its generalization.
The semi-supervised learning approach relies on having a sufficient amount of unlabeled data, which may not always be available in real-world settings.
While CLIP provides strong multimodal representations, its performance may still be limited by the biases and limitations of the data it was trained on. Exploring alternative pre-trained models could be valuable.

Overall, the technique proposed in this paper represents an important advancement in multimodal emotion recognition, but there are still interesting avenues for future research to further improve the robustness and generalization of these models.

Conclusion

This paper introduces a novel MER method that combines vision-language prompting and modality dropout. By fine-tuning the powerful CLIP model and using modality dropout during training, the approach achieves state-of-the-art results on several MER benchmarks.

The key insights are that leveraging pre-trained multimodal representations and explicitly training for robustness to missing modalities can significantly improve the performance of MER systems. This has important implications for building practical emotion AI applications that can handle real-world data challenges.

While the paper demonstrates the effectiveness of this approach, there are still open questions around its generalization and potential limitations that warrant further investigation. Nonetheless, this work represents an important step forward in multimodal emotion recognition research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Anbin QI, Zhongliang Liu, Xinyong Zhou, Jinba Xiao, Fengrun Zhang, Qi Gan, Ming Tao, Gaozheng Zhang, Lu Zhang

In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.

9/12/2024

Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples

Qi Fan, Yutong Li, Yi Xin, Xinyu Cheng, Guanglai Gao, Miao Ma

The Multimodal Emotion Recognition challenge MER2024 focuses on recognizing emotions using audio, language, and visual signals. In this paper, we present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI), which tackles the issue of limited annotated data in emotion recognition. Firstly, to address the class imbalance, we adopt an oversampling strategy. Secondly, we propose a modality representation combinatorial contrastive learning (MR-CCL) framework on the trimodal input data to establish robust initial models. Thirdly, we explore a self-training approach to expand the training set. Finally, we enhance prediction robustness through a multi-classifier weighted soft voting strategy. Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard. Our project is available at https://github.com/WooyoohL/MER2024-SEMI.

9/10/2024

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

Zirun Guo, Tao Jin, Zhou Zhao

The development of multimodal models has significantly advanced multimodal sentiment analysis and emotion recognition. However, in real-world applications, the presence of various missing modality cases often leads to a degradation in the model's performance. In this work, we propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities. Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts. These prompts enable the generation of missing modality features and facilitate the learning of intra- and inter-modality information. Through prompt learning, we achieve a substantial reduction in the number of trainable parameters. Our proposed method outperforms other methods significantly across all evaluation metrics. Extensive experiments and ablation studies are conducted to demonstrate the effectiveness and robustness of our method, showcasing its ability to effectively handle missing modalities.

7/9/2024

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, Jinming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cambria, Guoying Zhao, Bjorn W. Schuller, Jianhua Tao

Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing the dataset size and building more effective algorithms. However, due to problems such as complex environments and inaccurate annotations, current systems are hard to meet the demands of practical applications. Therefore, we organize the MER series of competitions to promote the development of this field. Last year, we launched MER2023, focusing on three interesting topics: multi-label learning, noise robustness, and semi-supervised learning. In this year's MER2024, besides expanding the dataset size, we further introduce a new track around open-vocabulary emotion recognition. The main purpose of this track is that existing datasets usually fix the label space and use majority voting to enhance the annotator consistency. However, this process may lead to inaccurate annotations, such as ignoring non-majority or non-candidate labels. In this track, we encourage participants to generate any number of labels in any category, aiming to describe emotional states as accurately as possible. Our baseline code relies on MERTools and is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.

7/19/2024