Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks: Application to 7th ABAW Challenge

Read original: arXiv:2407.12390 - Published 9/6/2024 by Josep Cabacas-Maso, Elena Ortega-Beltr'an, Ismael Benito-Altamirano, Carles Ventura

👁️

Overview

This paper presents a novel Dual-Direction Attention Mixed Feature Network (DDAMFN) for enhancing facial expression recognition.
The proposed DDAMFN model leverages a dual-direction attention mechanism and a mixed feature fusion strategy to improve the performance of facial expression recognition, particularly in the context of the 7th ABAW Challenge.

Plain English Explanation

The researchers have developed a new deep learning model called the Dual-Direction Attention Mixed Feature Network (DDAMFN) to improve the recognition of facial expressions. Facial expression recognition is the task of identifying the emotional state of a person based on their facial features. This is an important capability for applications like human-computer interaction, video analysis, and mental health monitoring.

The key innovations in the DDAMFN model are:

Dual-Direction Attention: The model uses an attention mechanism that looks at the input image from two different perspectives - one that focuses on the global context, and another that focuses on local facial features. This allows the model to better understand the relationship between the overall facial expression and the specific details that make up that expression.
Mixed Feature Fusion: The model combines different types of facial features, such as appearance-based features and geometry-based features, into a single representation. This helps the model capture a more comprehensive understanding of the facial expression.

The researchers evaluated the DDAMFN model on the 7th Affective Behavior Analysis in-the-Wild (ABAW) Challenge dataset, which is a widely used benchmark for facial expression recognition. The results showed that the DDAMFN model outperformed other state-of-the-art approaches, demonstrating the effectiveness of the dual-direction attention and mixed feature fusion strategies.

Technical Explanation

The researchers propose a Dual-Direction Attention Mixed Feature Network (DDAMFN) to enhance facial expression recognition. The DDAMFN model consists of several key components:

Backbone Network: The model uses a pre-trained vision transformer [<a href="https://aimodels.fyi/papers/arxiv/cross-task-multi-branch-vision-transformer-facial">1</a>] as the backbone network to extract visual features from the input facial images.
Dual-Direction Attention Module: This module employs a dual-direction attention mechanism, with one attention branch focusing on the global context of the facial expression and the other branch focusing on the local facial features. The outputs of the two attention branches are then fused to obtain a refined feature representation.
Mixed Feature Fusion: The model combines different types of facial features, including appearance-based features and geometry-based features, to create a more comprehensive representation of the facial expression. This is achieved through a feature fusion module that concatenates and processes the different feature types.
Prediction Head: The final layer of the DDAMFN model is a prediction head that maps the fused feature representation to the target emotion classes, such as the seven basic emotions (anger, disgust, fear, happiness, sadness, surprise, and neutral).

The researchers evaluated the DDAMFN model on the 7th ABAW Challenge dataset, which consists of facial images with corresponding emotion labels. The results showed that the DDAMFN model outperformed other state-of-the-art approaches, including models that leverage multi-task learning and cross-task feature fusion.

Critical Analysis

One potential limitation of the DDAMFN model is that it relies on a pre-trained vision transformer as the backbone network. While this approach can leverage the strong feature extraction capabilities of the transformer, it also means that the model's performance is dependent on the quality and coverage of the pre-training dataset. In some cases, the pre-trained features may not fully capture the nuances of facial expressions, especially for more subtle or complex emotional states.

Additionally, the paper does not provide a detailed analysis of the relative contributions of the dual-direction attention and mixed feature fusion components to the overall performance improvement. It would be interesting to see how each of these innovations impacts the model's accuracy, robustness, and generalization capabilities individually.

Finally, the researchers could have explored the transferability of the DDAMFN model to other facial expression recognition datasets or tasks, such as emotion recognition in the wild or compound emotion recognition. This would help to better understand the broader applicability and limitations of the proposed approach.

Conclusion

The Dual-Direction Attention Mixed Feature Network (DDAMFN) proposed in this paper represents an innovative approach to enhancing facial expression recognition. By leveraging a dual-direction attention mechanism and a mixed feature fusion strategy, the model is able to achieve state-of-the-art performance on the 7th ABAW Challenge dataset. This research contributes to the ongoing efforts to develop more accurate and robust facial expression recognition systems, which have important applications in fields such as human-computer interaction, mental health monitoring, and video analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks: Application to 7th ABAW Challenge

Josep Cabacas-Maso, Elena Ortega-Beltr'an, Ismael Benito-Altamirano, Carles Ventura

We present our contribution to the 7th ABAW challenge at ECCV 2024, by utilizing a Dual-Direction Attention Mixed Feature Network (DDAMFN) for multitask facial expression recognition, we achieve results far beyond the proposed baseline for the Multi-Task ABAW challenge. Our proposal uses the well-known DDAMFN architecture as base to effectively predict valence-arousal, emotion recognition, and facial action units. We demonstrate the architecture ability to handle these tasks simultaneously, providing insights into its architecture and the rationale behind its design. Additionally, we compare our results for a multitask solution with independent single-task performance.

9/6/2024

HSEmotion Team at the 7th ABAW Challenge: Multi-Task Learning and Compound Facial Expression Recognition

Andrey V. Savchenko

In this paper, we describe the results of the HSEmotion team in two tasks of the seventh Affective Behavior Analysis in-the-wild (ABAW) competition, namely, multi-task learning for simultaneous prediction of facial expression, valence, arousal, and detection of action units, and compound expression recognition. We propose an efficient pipeline based on frame-level facial feature extractors pre-trained in multi-task settings to estimate valence-arousal and basic facial expressions given a facial photo. We ensure the privacy-awareness of our techniques by using the lightweight architectures of neural networks, such as MT-EmotiDDAMFN, MT-EmotiEffNet, and MT-EmotiMobileFaceNet, that can run even on a mobile device without the need to send facial video to a remote server. It was demonstrated that a significant step in improving the overall accuracy is the smoothing of neural network output scores using Gaussian or box filters. It was experimentally demonstrated that such a simple post-processing of predictions from simple blending of two top visual models improves the F1-score of facial expression recognition up to 7%. At the same time, the mean Concordance Correlation Coefficient (CCC) of valence and arousal is increased by up to 1.25 times compared to each model's frame-level predictions. As a result, our final performance score on the validation set from the multi-task learning challenge is 4.5 times higher than the baseline (1.494 vs 0.32).

7/19/2024

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Kang Shen, Xuxiong Liu, Boyan Wang, Jun Yao, Xin Liu, Yujie Guan, Yu Wang, Gengchen Li, Xiao Sun

In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.

7/29/2024

👁️

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Bach Nguyen-Xuan, Thien Nguyen-Hoang, Thanh-Huy Nguyen, Nhu Tai-Do

Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior Analysis in-the-wild (ABAW) competition. By utilizing low-level feature information from the ipsilateral view (auxiliary view) before learning the high-level feature that emphasizes the shift in the human facial expression, our work seeks to provide a straightforward yet innovative way to improve the examined view (main view). We also suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model, focusing on pivotal local elements. The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset, as observed in both training and validation contexts.

5/14/2024