Causal Intervention for Subject-Deconfounded Facial Action Unit Recognition

Read original: arXiv:2204.07935 - Published 4/4/2024 by Yingjie Chen, Diqi Chen, Tao Wang, Yizhou Wang, Yun Liang

👁️

Overview

This paper proposes a causal inference framework to address the challenge of subject-invariant facial action unit (AU) recognition.
Facial AU recognition is difficult because the data distribution varies across different individuals.
The paper formulates the causal relationships among facial images, subjects, latent AU semantic relations, and estimated AU occurrence probabilities using a structural causal model.
The proposed CIS module is used to deconfound the confounder "Subject" in the causal diagram.
Experiments on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of the CIS module and the state-of-the-art performance of the CISNet model.

Plain English Explanation

Facial action unit (AU) recognition is the task of identifying specific movements or expressions in a person's face, such as raising an eyebrow or pursing the lips. This is a important task for applications like emotion recognition and human-computer interaction. However, subject-invariant AU recognition (being able to recognize AUs regardless of the individual's facial features) remains a challenging problem because the data distribution can vary significantly between different people.

To address this, the researchers in this paper developed a causal inference framework. They created a structural causal model that depicts the relationships between the facial images, the individual subjects, the underlying semantic information about the AUs, and the estimated probabilities of the AUs occurring. By understanding these causal relationships, they were able to design a causal intervention module (CIS) that could remove the confounding effect of the "Subject" variable in the model.

Through experiments on two popular AU recognition datasets, the researchers showed that their CISNet model, which incorporates the CIS module, achieves state-of-the-art performance in subject-invariant AU recognition. This is an important advance that could improve the robustness and reliability of facial analysis systems in real-world applications.

Technical Explanation

The paper proposes a causal inference framework to address the challenge of subject-invariant facial action unit (AU) recognition. The authors first formulate the causal relationships among facial images, subjects, latent AU semantic relations, and estimated AU occurrence probabilities using a structural causal model. This causal diagram helps clarify the causal effect among these variables.

The researchers then introduce a causal intervention module (CIS) to deconfound the confounder "Subject" in the causal diagram. By applying the CIS module, the model can learn representations that are less dependent on the subject-specific information, improving its ability to generalize to new individuals.

The proposed CISNet model, which integrates the CIS module, is evaluated on two widely used AU recognition datasets: BP4D and DISFA. The experimental results demonstrate the effectiveness of the CIS module and show that CISNet achieves state-of-the-art performance in subject-invariant AU recognition tasks.

Critical Analysis

The paper presents a well-designed causal inference framework to address the challenging problem of subject-invariant facial AU recognition. The authors' approach of explicitly modeling the causal relationships in the task is a thoughtful and principled way to tackle the confounding effects of subject-specific factors.

However, the paper does not discuss the potential limitations or caveats of the proposed method. For example, it is unclear how the CIS module would perform in scenarios with limited training data or a large number of subjects. Additionally, the paper does not explore the interpretability or explainability of the learned representations, which could be an important consideration for real-world applications of facial analysis systems.

Furthermore, the paper could have benefited from a more thorough comparison to related work in the field of subject-invariant or domain-adaptive facial analysis. Discussing how the proposed approach differs from or builds upon existing methods would help readers better understand the novelty and contributions of this research.

Conclusion

This paper presents a causal inference framework for subject-invariant facial action unit recognition, a challenging problem in facial analysis. By modeling the causal relationships among relevant variables, the researchers developed a causal intervention module (CIS) that can effectively remove the confounding effects of subject-specific factors. The extensive experiments on benchmark datasets demonstrate the effectiveness of the CIS module and the state-of-the-art performance of the CISNet model.

This work represents an important step forward in improving the robustness and generalization capabilities of facial analysis systems, which have numerous applications in areas such as human-computer interaction, emotion recognition, and video action reasoning. Further research is needed to address the potential limitations and explore the interpretability of the learned representations, but this paper lays a strong foundation for advancing the field of subject-invariant facial analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Causal Intervention for Subject-Deconfounded Facial Action Unit Recognition

Yingjie Chen, Diqi Chen, Tao Wang, Yizhou Wang, Yun Liang

Subject-invariant facial action unit (AU) recognition remains challenging for the reason that the data distribution varies among subjects. In this paper, we propose a causal inference framework for subject-invariant facial action unit recognition. To illustrate the causal effect existing in AU recognition task, we formulate the causalities among facial images, subjects, latent AU semantic relations, and estimated AU occurrence probabilities via a structural causal model. By constructing such a causal diagram, we clarify the causal effect among variables and propose a plug-in causal intervention module, CIS, to deconfound the confounder emph{Subject} in the causal diagram. Extensive experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of our CIS, and the model with CIS inserted, CISNet, has achieved state-of-the-art performance.

4/4/2024

Learning Contrastive Feature Representations for Facial Action Unit Detection

Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li

Facial action unit (AU) detection has long encountered the challenge of detecting subtle feature differences when AUs activate. Existing methods often rely on encoding pixel-level information of AUs, which not only encodes additional redundant information but also leads to increased model complexity and limited generalizability. Additionally, the accuracy of AU detection is negatively impacted by the class imbalance issue of each AU type, and the presence of noisy and false AU labels. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on four widely-utilized benchmark datasets (BP4D, DISFA, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at url{https://github.com/Ziqiao-Shang/AUNCE}.

9/24/2024

One-Frame Calibration with Siamese Network in Facial Action Unit Recognition

Shuangquan Feng, Virginia R. de Sa

Automatic facial action unit (AU) recognition is used widely in facial expression analysis. Most existing AU recognition systems aim for cross-participant non-calibrated generalization (NCG) to unseen faces without further calibration. However, due to the diversity of facial attributes across different identities, accurately inferring AU activation from single images of an unseen face is sometimes infeasible, even for human experts -- it is crucial to first understand how the face appears in its neutral expression, or significant bias may be incurred. Therefore, we propose to perform one-frame calibration (OFC) in AU recognition: for each face, a single image of its neutral expression is used as the reference image for calibration. With this strategy, we develop a Calibrating Siamese Network (CSN) for AU recognition and demonstrate its remarkable effectiveness with a simple iResNet-50 (IR50) backbone. On the DISFA, DISFA+, and UNBC-McMaster datasets, we show that our OFC CSN-IR50 model (a) substantially improves the performance of IR50 by mitigating facial attribute biases (including biases due to wrinkles, eyebrow positions, facial hair, etc.), (b) substantially outperforms the naive OFC method of baseline subtraction as well as (c) a fine-tuned version of this naive OFC method, and (d) also outperforms state-of-the-art NCG models for both AU intensity estimation and AU detection.

9/4/2024

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (aus) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate au cues into classifier training, allowing to train deep interpretable models. During training, this au codebook is used, along with the input image expression label, and facial landmarks, to construct a au heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with au heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with au maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks rafdb, and affectnet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

5/15/2024