CAGE: Circumplex Affect Guided Expression Inference

2404.14975

Published 4/24/2024 by Niklas Wagner, Felix Matzler, Samed R. Vossberg, Helen Schneider, Svetlana Pavlitska, J. Marius Zollner

cs.CV

🤯

Abstract

Understanding emotions and expressions is a task of interest across multiple disciplines, especially for improving user experiences. Contrary to the common perception, it has been shown that emotions are not discrete entities but instead exist along a continuum. People understand discrete emotions differently due to a variety of factors, including cultural background, individual experiences, and cognitive biases. Therefore, most approaches to expression understanding, particularly those relying on discrete categories, are inherently biased. In this paper, we present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect. Further, we propose a model for the prediction of facial expressions tailored for lightweight applications. Using a small-scaled MaxViT-based model architecture, we evaluate the impact of discrete expression category labels in training with the continuous valence and arousal labels. We show that considering valence and arousal in addition to discrete category labels helps to significantly improve expression inference. The proposed model outperforms the current state-of-the-art models on AffectNet, establishing it as the best-performing model for inferring valence and arousal achieving a 7% lower RMSE. Training scripts and trained weights to reproduce our results can be found here: https://github.com/wagner-niklas/CAGE_expression_inference.

Create account to get full access

Overview

Emotions exist on a continuum, not in discrete categories
People understand emotions differently due to factors like culture and experiences
Most current approaches to understanding expressions are biased by relying on discrete categories
This paper presents a comparative analysis of two emotion datasets and proposes a model for predicting facial expressions that considers both discrete categories and continuous dimensions of emotion

Plain English Explanation

Emotions are not simple, black-and-white things - they exist on a spectrum. How people perceive and understand emotions can vary a lot based on their background, personal experiences, and even cognitive biases. Most current approaches to recognizing facial expressions and emotions rely on categorizing them into discrete "boxes," which can be an oversimplification.

This research paper takes a closer look at two common datasets used for studying emotions and expressions. It then proposes a new model for predicting facial expressions that considers both the traditional discrete emotion categories as well as the continuous dimensions of valence and arousal. The researchers found that incorporating both types of information - discrete categories and continuous dimensions - helps significantly improve the accuracy of expression recognition.

Technical Explanation

The paper begins by highlighting the limitations of the common perception that emotions are discrete, rather than existing on a continuum. It notes that people's understanding of emotions is influenced by factors like cultural background, individual experiences, and cognitive biases. Therefore, approaches relying solely on discrete emotion categories are inherently biased.

The researchers then present a comparative analysis of two popular emotion datasets, AffectNet and EMOTIC, which are annotated with components of the circumplex model of affect. This model represents emotions along the dimensions of valence (positive to negative) and arousal (calm to excited).

Next, the paper proposes a new model for predicting facial expressions, designed for lightweight applications. The model uses a small-scaled MaxViT-based architecture and is evaluated on its ability to infer both discrete expression categories and continuous valence/arousal values. The results show that considering both types of emotion information significantly improves expression prediction performance compared to using discrete categories alone.

Critical Analysis

The paper acknowledges that emotions are complex and multifaceted, and that relying solely on discrete categories can be an oversimplification. By incorporating the continuous dimensions of valence and arousal, the proposed model appears to provide a more nuanced and accurate representation of emotional expressions.

However, the paper does not discuss the potential limitations or challenges of this approach. For example, it's unclear how well the model would generalize to more diverse or culturally-specific emotional expressions, or how it might perform in real-world applications with noisy or partial data.

Additionally, the paper does not explore the trade-offs between the increased complexity of the proposed model and its potential for deployment in lightweight or resource-constrained settings. Further research may be needed to understand the practical implications and feasibility of this approach.

Conclusion

This research paper challenges the common perception of emotions as discrete categories and presents a novel model for predicting facial expressions that considers both categorical and dimensional information. By incorporating the continuous dimensions of valence and arousal, the proposed model demonstrates improved performance over approaches relying only on discrete emotion labels.

The findings suggest that a more nuanced, context-aware understanding of emotions can lead to advancements in areas like user experience and emotion recognition. As the field continues to evolve, further research exploring the practical implications and limitations of this approach could help pave the way for more robust and inclusive emotion-based technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning

Lukas Christ, Shahin Amiriparian, Manuel Milling, Ilhan Aslan, Bjorn W. Schuller

Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modeling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no benchmark for this task. We address this gap by introducing continuous valence and arousal labels for an existing dataset of children's stories originally annotated with discrete emotion categories. We collect additional annotations for this data and map the categorical labels to the continuous valence and arousal space. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. The best configuration achieves a Concordance Correlation Coefficient (CCC) of $.8221$ for valence and $.7125$ for arousal on the test set, demonstrating the efficacy of our proposed approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story. In addition, we uncover the weaknesses of our approach by investigating examples that prove to be difficult to predict.

6/5/2024

cs.CL cs.AI

Cluster-to-Predict Affect Contours from Speech

Gokhan Kuc{s}c{c}u, Engin Erzin

Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model.

6/6/2024

eess.AS cs.HC

Robust Emotion Recognition in Context Debiasing

Dingkang Yang, Kun Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Lihua Zhang

Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements, the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation, causing severe performance bottlenecks and confounding valuable context priors. In this paper, we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically, we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph, CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference, we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes, resulting in bias mitigation and robust prediction. As a model-agnostic framework, CLEF can be readily integrated into existing methods, bringing consistent performance gains.

6/4/2024

cs.CV cs.LG

Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Michael Mitsios, Georgios Vamvoukakis, Georgia Maniati, Nikolaos Ellinas, Georgios Dimitriou, Konstantinos Markopoulos, Panos Kakoulidis, Alexandra Vioni, Myrsini Christidou, Junkwang Oh, Gunu Jho, Inchul Hwang, Georgios Vardaxoglou, Aimilios Chalamandaris, Pirros Tsiakoulis, Spyros Raptis

Emotion detection in textual data has received growing interest in recent years, as it is pivotal for developing empathetic human-computer interaction systems. This paper introduces a method for categorizing emotions from text, which acknowledges and differentiates between the diversified similarities and distinctions of various emotions. Initially, we establish a baseline by training a transformer-based model for standard emotion classification, achieving state-of-the-art performance. We argue that not all misclassifications are of the same importance, as there are perceptual similarities among emotional classes. We thus redefine the emotion labeling problem by shifting it from a traditional classification model to an ordinal classification one, where discrete emotions are arranged in a sequential order according to their valence levels. Finally, we propose a method that performs ordinal classification in the two-dimensional emotion space, considering both valence and arousal scales. The results show that our approach not only preserves high accuracy in emotion prediction but also significantly reduces the magnitude of errors in cases of misclassification.

4/3/2024

cs.LG