Learning Contrastive Feature Representations for Facial Action Unit Detection

Read original: arXiv:2402.06165 - Published 7/15/2024 by Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li

Learning Contrastive Feature Representations for Facial Action Unit Detection

Overview

The paper explores a novel contrastive learning approach for facial action unit detection, which aims to recognize facial muscle movements associated with various expressions.
The key contributions include a positive sample sampling strategy and an importance re-weighting technique to enhance the contrastive learning process.
The proposed method demonstrates improved performance over existing state-of-the-art techniques on standard facial action unit detection benchmarks.

Plain English Explanation

The paper presents a new way to train AI models to recognize different facial expressions by looking at specific muscle movements in the face, known as facial action units. Traditional methods for this task can struggle to capture the nuanced differences between similar expressions.

The researchers introduce a "contrastive learning" approach, which tries to teach the model to identify the unique features that distinguish one facial expression from another. This involves showing the model many examples of the same expression, as well as examples of different expressions, and having it learn what makes them distinct.

To make this contrastive learning process more effective, the paper proposes two key innovations. First, it describes a method for intelligently selecting the "positive" examples - the ones that are most helpful for the model to learn from. Second, it presents a way to adjust the relative importance of different training examples, putting more focus on the ones that are most informative.

By incorporating these techniques, the model is able to learn more powerful and discriminative features for facial action unit detection, leading to better performance on standard benchmarks compared to prior state-of-the-art methods. This could enable more accurate and robust facial expression recognition in applications like human-robot interaction, emotion analysis, and assistive technologies.

Technical Explanation

The paper proposes a novel contrastive learning framework for facial action unit detection. Contrastive learning aims to learn feature representations that can effectively distinguish between different classes by maximizing the similarity between samples of the same class and minimizing the similarity between samples of different classes.

The key technical contributions of the paper are:

Positive Sample Sampling: The researchers introduce a method for intelligently selecting which positive (same-class) examples to use during contrastive learning. This is important because not all positive samples are equally informative for the model to learn from.
Importance Re-weighting Strategy: The paper also presents a technique to dynamically adjust the relative importance of different training examples during the contrastive learning process. This allows the model to focus more on the most informative samples.

The overall training pipeline consists of three steps:

Feature Extraction: A backbone neural network is used to extract visual features from the input facial images.
Positive Sample Sampling: A sampling strategy is employed to select the most useful positive samples for contrastive learning.
Contrastive Learning with Importance Re-weighting: The contrastive loss is computed, with the positive and negative samples weighted according to their estimated importance.

The authors evaluate their approach on several standard facial action unit detection benchmarks, including BP4D, DISFA, and EmotionNet. The results demonstrate that their method outperforms existing state-of-the-art techniques, achieving new state-of-the-art performance.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed contrastive learning approach for facial action unit detection. The positive sample sampling and importance re-weighting strategies appear to be effective innovations that contribute to the improved performance.

However, one potential limitation is that the method may be sensitive to the quality and diversity of the training data. If the dataset does not contain a representative sample of facial expressions, the contrastive learning process may struggle to learn robust and generalizable features. It would be interesting to see how the method performs on more challenging, in-the-wild datasets with greater variability in lighting, occlusions, and subject demographics.

Additionally, the paper does not provide much insight into the types of errors the model makes or the specific facial features it learns to focus on. A more detailed analysis of the model's failures and the interpretability of its learned representations could help researchers better understand the strengths and limitations of the approach.

Finally, the paper does not address the computational efficiency of the proposed method, which is an important consideration for real-world deployment, especially in applications that require fast inference times, such as human-robot interaction or emotion analysis. Investigating the trade-offs between model accuracy and efficiency would be a valuable direction for future research.

Conclusion

The paper presents a novel contrastive learning framework for facial action unit detection that outperforms existing state-of-the-art methods. The key innovations, including positive sample sampling and importance re-weighting, demonstrate the potential of contrastive learning techniques to capture more discriminative features for this task.

The improved facial action unit detection performance could have significant implications for a wide range of applications, from human-robot interaction to emotion analysis and assistive technologies. By better recognizing subtle facial movements, the proposed approach could enable more natural and personalized interactions with technology, as well as more accurate detection and understanding of human emotional states.

However, the paper also highlights the need for further research to address potential limitations, such as dataset bias and model interpretability. Continued advancements in this area could pave the way for even more robust and versatile facial expression recognition systems that can benefit a wide range of users and applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Learning Contrastive Feature Representations for Facial Action Unit Detection

Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li

Facial action unit (AU) detection has long encountered the challenge of detecting subtle feature differences when AUs activate. Existing methods often rely on encoding pixel-level information of AUs, which not only encodes additional redundant information but also leads to increased model complexity and limited generalizability. Additionally, the accuracy of AU detection is negatively impacted by the class imbalance issue of each AU type, and the presence of noisy and false AU labels. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on four widely-utilized benchmark datasets (BP4D, DISFA, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at url{https://github.com/Ziqiao-Shang/AUNCE}.

7/15/2024

New!Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue

Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.

9/16/2024

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

Kaishen Yuan, Zitong Yu, Xin Liu, Weicheng Xie, Huanjing Yue, Jingyu Yang

Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code for AUFormer is available at https://github.com/yuankaishen2001/AUFormer.

7/10/2024

Representation Learning and Identity Adversarial Training for Facial Behavior Understanding

Mang Ning, Albert Ali Salah, Itir Onal Ertugrul

Facial Action Unit (AU) detection has gained significant research attention as AUs contain complex expression information. In this paper, we unpack two fundamental factors in AU detection: data and subject identity regularization, respectively. Motivated by recent advances in foundation models, we highlight the importance of data and collect a diverse dataset Face9M, comprising 9 million facial images, from multiple public resources. Pretraining a masked autoencoder on Face9M yields strong performance in AU detection and facial expression tasks. We then show that subject identity in AU datasets provides a shortcut learning for the model and leads to sub-optimal solutions to AU predictions. To tackle this generic issue of AU tasks, we propose Identity Adversarial Training (IAT) and demonstrate that a strong IAT regularization is necessary to learn identity-invariant features. Furthermore, we elucidate the design space of IAT and empirically show that IAT circumvents the identity shortcut learning and results in a better solution. Our proposed methods, Facial Masked Autoencoder (FMAE) and IAT, are simple, generic and effective. Remarkably, the proposed FMAE-IAT approach achieves new state-of-the-art F1 scores on BP4D (67.1%), BP4D+ (66.8%), and DISFA (70.1%) databases, significantly outperforming previous work. We release the code and model at https://github.com/forever208/FMAE-IAT, the first open-sourced facial model pretrained on 9 million diverse images.

7/17/2024