Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

2403.12687

Published 4/1/2024 by Elena Ryumina, Maxim Markitantov, Dmitry Ryumin, Heysem Kaya, Alexey Karpov

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

Abstract

This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Using our proposed method is achieved an F1-score value equals to 22.01% on the C-EXPR-DB test subset. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for developing intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions.

Create account to get full access

Introduction

The paper presents a novel method for audio-visual Compound Expression Recognition (CER) as part of the 6th ABAW CE Recognition Challenge. CER is a crucial task in affective computing and intelligent human-computer interaction, involving the automated identification of complex emotional states that combine two or more basic emotions.

Existing methods for CER have focused primarily on the visual modality, using deep learning models and facial action units. However, the authors note the need for relevant training data comprising balanced samples for each class, collected under uncontrolled conditions and containing multimodal data.

To address the challenge of insufficient publicly available CER data, the proposed method does not utilize the CER challenge data for training. Instead, it includes models trained for basic emotion recognition and applies a rule-based decision-making process to determine the predicted CEs.

The key contributions of the paper are:

A novel audio-visual CER method based on basic emotion recognition and multimodal fusion.
A method for audio-visual emotion recognition using multi-corpus and cross-corpus research.
A rule-based decision-making method for CER that identifies the modality responsible for specific CEs.
Baseline performance measures for the recognition of seven basic emotions on the Validation subsets of the AffWild2 and AFEW corpora.

Figure 1: Pipeline of the proposed audio-visual CER method. PD refers to probability distribution.

Proposed Method

This paper proposes an audio-visual method for recognizing compound emotions (CEs). The key components are:

Video models: A static visual model based on ResNet50 and a dynamic visual model based on LSTM are used to detect basic emotions in video frames. Data augmentation techniques are applied to improve generalizability.

Audio model: A sequence-to-one acoustic model based on pre-trained Wav2vec2 is used to recognize emotions from audio. Voice activity detection is performed using audio and video cues.

Modality fusion: A hierarchical probability weighting approach is used to combine the outputs of the video and audio models. This enhances performance on both basic emotion and CE recognition.

Rule-based decision-making: Two rules are applied to the fused probabilities - one to mask low-probability emotions, and one to weight the importance of less frequent emotions in compound emotions.

The proposed method is evaluated on multiple datasets for basic emotion and CE recognition, demonstrating improved performance compared to using individual modalities.

Experiments

The text summarizes the research corpora and experimental results for an audio-visual compound emotion recognition (CER) method.

The researchers used several corpora to train, validate, and test their emotion recognition models:

The AffectNet corpus was used to train a static video model.
Dynamic visual models were trained on the RAMAS, RAVDESS, CREMA-D, IEMOCAP, and SAVEE corpora.
Audio models were trained on the AffWild2 and MELD corpora.
Validation and optimization were done on the AffWild2 and AFEW validation sets.
Testing was performed on the non-annotated C-EXPR-DB corpus.

The experimental results showed that the dynamic visual model outperformed the static model on the AffWild2 corpus, while the reverse was true for the AFEW corpus. The hierarchical weighting fusion of the visual models outperformed Dirichlet-based weighting.

The audio models had lower performance than the visual models, so they were not tested alone on C-EXPR-DB. Fusion of audio and video models using hierarchical weighting was found to be more effective than Dirichlet-based weighting.

The analysis of the fusion model weights indicated that the method relied more on the dynamic visual model for certain compound emotions and the static visual model for others. The acoustic model had a smaller contribution, and hierarchical weighting reduced its influence.

An example CER result on the C-EXPR-DB corpus was presented, showing the strengths of the audio-visual model in correctly predicting a range of compound emotions.

Conclusions

The paper proposes a new audio-visual method for compound emotion recognition (CER). The method integrates three models: static and dynamic visual models, as well as an audio model. Each model predicts the probabilities for six basic emotions and the neutral state. These emotional probabilities are then weighted using the Dirichlet distribution. Two rules are applied to determine the compound emotion.

The paper also provides new baselines for recognizing seven emotions on the validation subsets of the AffWild2 and AFEW corpora. The experimental results show that each model is responsible for predicting specific compound emotions. The audio model predicts Angry Surprised and Sadly Angry, the static visual model predicts Happily Surprised, and the dynamic visual model predicts other compound emotions well. The proposed method has the potential to lead to intelligent software tools for faster annotation of data containing both basic and compound emotional expressions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention

R. Gnana Praveen, Jahangir Alam

Person or identity verification has been recently gaining a lot of attention using audio-visual fusion as faces and voices share close associations with each other. Conventional approaches based on audio-visual fusion rely on score-level or early feature-level fusion techniques. Though existing approaches showed improvement over unimodal systems, the potential of audio-visual fusion for person verification is not fully exploited. In this paper, we have investigated the prospect of effectively capturing both the intra- and inter-modal relationships across audio and visual modalities, which can play a crucial role in significantly improving the fusion performance over unimodal systems. In particular, we introduce a recursive fusion of a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework in a recursive fashion to progressively refine the feature representations that can efficiently capture the intra-and inter-modal relationships. To further enhance the audio-visual feature representations, we have also explored BLSTMs to improve the temporal modeling of audio-visual feature representations. Extensive experiments are conducted on the Voxceleb1 dataset to evaluate the proposed model. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships across audio and visual modalities.

4/29/2024

cs.CV cs.SD eess.AS

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson

Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to local details, such as facial state changes between video frames, which can reduce the discriminability of features and thus lower recognition accuracy. In this paper, we propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for AVER, incorporating several novel aspects. We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations. We also design attentive intra- and inter-modal feature enhancement modules to further improve the richness and discriminability of video and audio representations. A detailed quantitative evaluation shows that our proposed model outperforms all existing methods on three benchmark datasets for both concrete and continuous emotion recognition. To encourage further research and ensure replicability, we will release our full code upon acceptance.

5/28/2024

cs.CV

🗣️

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg, Radu Timofte

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.

5/24/2024

eess.AS cs.AI cs.CV cs.MM cs.SD

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

6/18/2024

cs.AI cs.MM