Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Read original: arXiv:2404.14606 - Published 5/1/2024 by Armando Zhu, Keqin Li, Tong Wu, Peng Zhao, Bo Hong

👀

Overview

Researchers propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks.
The approach extracts shared features for both tasks using a dual-branch architecture to obtain multi-scale feature representations.
The researchers introduce a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module.
The proposed framework reduces the overall complexity compared to using separate networks for both tasks.
Experiments show the model performs better or on par with state-of-the-art methods on both tasks.

Plain English Explanation

As wearing masks has become more common, recognizing facial expressions while taking masks into account has become a significant challenge. The researchers have developed a new approach to address this problem.

Their method uses a unified multi-branch vision transformer that can perform two tasks at once: recognizing facial expressions and detecting whether a person is wearing a mask.

The key idea is to have the model extract shared features that are useful for both tasks. It does this using a dual-branch architecture that can capture information at different scales. Additionally, the researchers introduced a "cross-task fusion" step, where the model processes the information for each task separately, but also exchanges relevant details between the two tasks using a special attention mechanism.

This unified approach is more efficient than using two separate models for the two tasks. And the experiments show it performs better or equally well compared to other state-of-the-art methods on both facial expression recognition and mask wearing classification.

Technical Explanation

The researchers propose a unified multi-branch vision transformer for jointly tackling facial expression recognition and mask wearing classification. Their approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations.

Furthermore, the researchers introduce a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. This simple yet effective fusion mechanism reduces the overall complexity compared to using separate networks for both tasks.

Extensive experiments demonstrate that the proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification tasks.

Critical Analysis

The paper presents a novel and effective approach to jointly tackle facial expression recognition and mask wearing classification. The cross-task fusion mechanism is a clever way to share relevant information between the two tasks without significantly increasing the model complexity.

However, the paper does not provide much discussion on the limitations of the proposed framework. For example, it would be interesting to understand how the model performs on more diverse and challenging real-world datasets, beyond the commonly used benchmarks.

Additionally, the researchers could explore whether the shared features learned by the model can be effectively transferred to other related tasks, such as multi-feature reconstruction for masked faces. This could further demonstrate the versatility and wider applicability of the unified transformer architecture.

Overall, the research makes a valuable contribution to the field of facial analysis in the context of mask-wearing. But as with any research, there are opportunities for continued exploration and improvement.

Conclusion

The researchers have developed a unified multi-branch vision transformer that can effectively perform facial expression recognition and mask wearing classification simultaneously. By extracting shared features and using a cross-task fusion mechanism, their approach is more efficient and effective than using separate models.

This work highlights the importance of adapting computer vision techniques to the new reality of widespread mask-wearing. The proposed framework could have important applications in areas like human-computer interaction, security, and healthcare, where accurate facial analysis is crucial. As mask-wearing becomes a lasting cultural norm, solutions like this will be vital for ensuring robust and inclusive computer vision systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Armando Zhu, Keqin Li, Tong Wu, Peng Zhao, Bo Hong

With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.

5/1/2024

👁️

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Bach Nguyen-Xuan, Thien Nguyen-Hoang, Thanh-Huy Nguyen, Nhu Tai-Do

Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior Analysis in-the-wild (ABAW) competition. By utilizing low-level feature information from the ipsilateral view (auxiliary view) before learning the high-level feature that emphasizes the shift in the human facial expression, our work seeks to provide a straightforward yet innovative way to improve the examined view (main view). We also suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model, focusing on pivotal local elements. The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset, as observed in both training and validation contexts.

5/14/2024

Task-adaptive Q-Face

Haomiao Sun, Mingjie He, Shiguang Shan, Hu Han, Xilin Chen

Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.

5/16/2024

Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

Kang Shen, Xuxiong Liu, Boyan Wang, Jun Yao, Xin Liu, Yujie Guan, Yu Wang, Gengchen Li, Xiao Sun

In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.

7/29/2024