AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

Read original: arXiv:2407.11468 - Published 7/17/2024 by Qiaoqiao Jin, Rui Shi, Yishun Dou, Bingbing Ni

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

Overview

This paper presents AU-vMAE, a novel method for detecting facial Action Units (AUs) in video data using a Video Masked Autoencoder (vMAE) architecture.
The key idea is to leverage knowledge guidance, where the model is trained to not only reconstruct the original video, but also to predict the ground truth AU labels.
The authors claim this approach outperforms state-of-the-art methods for AU detection on several benchmark datasets.

Plain English Explanation

The paper describes a new way to automatically detect facial expressions in video. Facial expressions are often described in terms of Action Units (AUs), which are the basic movements of different facial muscles.

The proposed method, called AU-vMAE, uses a Video Masked Autoencoder (vMAE) to analyze the video frames. An autoencoder is a type of neural network that learns to compress and then reconstruct its input. In this case, the network is trained not only to reconstruct the original video, but also to predict the correct AU labels for each frame.

The key insight is that by guiding the network with the ground truth AU labels during training, it can learn more robust and accurate features for detecting AUs. The authors show that this "knowledge-guided" approach outperforms other state-of-the-art methods for AU detection on several benchmark datasets.

Technical Explanation

The core of the AU-vMAE model is a Video Masked Autoencoder (vMAE) architecture. The encoder takes in a sequence of video frames and learns a compressed representation. The decoder then tries to reconstruct the original video frames from this representation.

However, unlike a standard autoencoder, AU-vMAE is also trained to predict the ground truth AU labels for each video frame. This is achieved by adding a parallel AU prediction head to the decoder. The loss function for training the model combines the reconstruction loss and the AU prediction loss, encouraging the network to learn features that are useful for both tasks.

The authors hypothesize that this "knowledge-guided" training approach helps the model learn more robust and discriminative features for AU detection, compared to training solely for video reconstruction. They evaluate AU-vMAE on several benchmark datasets, including BP4D and DISFA, and show that it outperforms state-of-the-art methods in terms of AU detection accuracy.

Critical Analysis

The paper presents a novel and promising approach to AU detection in video data. The key idea of leveraging AU label information during training is well-motivated and the experimental results are compelling. However, there are a few potential limitations and areas for further research:

The performance of AU-vMAE is still not perfect, and there may be room for improvement, especially on more challenging datasets or for subtle AU expressions.
The paper does not provide much insight into the internal representations learned by the model or how the knowledge guidance affects the learned features. More analysis in this direction could help understand the model's strengths and weaknesses.
The experiments are conducted on relatively small datasets, and it would be important to validate the approach on larger, more diverse datasets to ensure its robustness and generalization capability.
The computational complexity and training time of the AU-vMAE model are not discussed, which could be an important practical consideration for real-world applications.

Overall, the AU-vMAE method represents a valuable contribution to the field of facial expression analysis, and the knowledge-guided approach is an interesting direction for further research and development.

Conclusion

The AU-vMAE model proposed in this paper demonstrates a novel and effective way to leverage knowledge guidance for improved facial Action Unit detection in video data. By training the model to not only reconstruct the input video but also predict the ground truth AU labels, the authors show that it can learn more robust and discriminative features for this task.

The promising results on benchmark datasets suggest that the knowledge-guided approach could be a valuable addition to the toolbox of facial expression analysis techniques. Further research to address the potential limitations and explore the model's internal representations could lead to even more accurate and insightful AU detection systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder

Qiaoqiao Jin, Rui Shi, Yishun Dou, Bingbing Ni

Current Facial Action Unit (FAU) detection methods generally encounter difficulties due to the scarcity of labeled video training data and the limited number of training face IDs, which renders the trained feature extractor insufficient coverage for modeling the large diversity of inter-person facial structures and movements. To explicitly address the above challenges, we propose a novel video-level pre-training scheme by fully exploring the multi-label property of FAUs in the video as well as the temporal label consistency. At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder together with a fine-tuning network that jointly completes the multi-level video FAUs analysis tasks, emph{i.e.} integrating both video-level and frame-level FAU detections, thus dramatically expanding the supervision set from sparse FAUs annotations to ALL video frames including masked ones. Moreover, we utilize inter-frame and intra-frame AU pair state matrices as prior knowledge to guide network training instead of traditional Graph Neural Networks, for better temporal supervision. Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.

7/17/2024

Representation Learning and Identity Adversarial Training for Facial Behavior Understanding

Mang Ning, Albert Ali Salah, Itir Onal Ertugrul

Facial Action Unit (AU) detection has gained significant research attention as AUs contain complex expression information. In this paper, we unpack two fundamental factors in AU detection: data and subject identity regularization, respectively. Motivated by recent advances in foundation models, we highlight the importance of data and collect a diverse dataset Face9M, comprising 9 million facial images, from multiple public resources. Pretraining a masked autoencoder on Face9M yields strong performance in AU detection and facial expression tasks. We then show that subject identity in AU datasets provides a shortcut learning for the model and leads to sub-optimal solutions to AU predictions. To tackle this generic issue of AU tasks, we propose Identity Adversarial Training (IAT) and demonstrate that a strong IAT regularization is necessary to learn identity-invariant features. Furthermore, we elucidate the design space of IAT and empirically show that IAT circumvents the identity shortcut learning and results in a better solution. Our proposed methods, Facial Masked Autoencoder (FMAE) and IAT, are simple, generic and effective. Remarkably, the proposed FMAE-IAT approach achieves new state-of-the-art F1 scores on BP4D (67.1%), BP4D+ (66.8%), and DISFA (70.1%) databases, significantly outperforming previous work. We release the code and model at https://github.com/forever208/FMAE-IAT, the first open-sourced facial model pretrained on 9 million diverse images.

7/17/2024

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

Xuri Ge, Junchen Fu, Fuhai Chen, Shan An, Nicu Sebe, Joemon M. Jose

Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.

8/2/2024

New!Towards Unified Facial Action Unit Recognition Framework by Large Language Models

Guohong Hu, Xing Lan, Hanyu Jiang, Jiayi Lyu, Jian Xue

Facial Action Units (AUs) are of great significance in the realm of affective computing. In this paper, we propose AU-LLaVA, the first unified AU recognition framework based on the Large Language Model (LLM). AU-LLaVA consists of a visual encoder, a linear projector layer, and a pre-trained LLM. We meticulously craft the text descriptions and fine-tune the model on various AU datasets, allowing it to generate different formats of AU recognition results for the same input image. On the BP4D and DISFA datasets, AU-LLaVA delivers the most accurate recognition results for nearly half of the AUs. Our model achieves improvements of F1-score up to 11.4% in specific AU recognition compared to previous benchmark results. On the FEAFA dataset, our method achieves significant improvements over all 24 AUs compared to previous benchmark results. AU-LLaVA demonstrates exceptional performance and versatility in AU recognition.

9/16/2024