Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations

Read original: arXiv:2405.06319 - Published 5/13/2024 by Hanna-Sophia Widhoelzl, Ece Takmaz

↗️

Overview

This study investigates how well a pretrained multimodal model, CLIP, can recognize emotions evoked by abstract visual art.
The researchers use a dataset of images with emotion labels and textual explanations provided by human annotators.
They perform linguistic analysis of the emotion explanations, test CLIP's ability to classify emotions in images and texts, and explore color-emotion associations.
The results suggest that CLIP's emotion recognition does not fully align with human cognitive processes when it comes to abstract art.

Plain English Explanation

The researchers wanted to see how well a powerful AI model called CLIP can understand the emotions that people feel when looking at abstract art. CLIP is trained on a huge amount of image and text data, and it's really good at recognizing objects and understanding language. But the researchers wondered if CLIP would be able to capture the more complex, subjective emotions that humans experience when looking at abstract, non-representational art.

To test this, the researchers used a dataset of abstract art images that had been labeled by humans with different emotion categories, like "anger" or "joy." The humans also provided short explanations for why they chose those emotion labels. The researchers then looked at how CLIP performed on two tasks: 1) Classifying the emotions in the images, and 2) Classifying the emotions in the human-written explanations.

The results showed that CLIP was able to identify the emotions to some degree, but not as accurately as a human would. This suggests that CLIP's way of processing and understanding emotions doesn't fully match up with how humans process the emotional complexities of abstract art. The researchers also found some interesting connections between the colors used in the art and the emotions CLIP associated with them - for example, CLIP tended to link the color red more strongly with "anger" than humans did.

Overall, this study highlights the differences between how AI models and human brains perceive and process the emotional content of abstract visual art. It's an important step in understanding the limitations of current AI systems when it comes to navigating the nuances of human cognition and experience.

Technical Explanation

The researchers employed a dataset of abstract art images with associated emotion labels and textual rationales provided by human annotators. They conducted linguistic analyses of the rationales, performed zero-shot emotion classification of the images and rationales using the CLIP model, applied similarity-based prediction of emotion, and investigated color-emotion associations.

The results showed relatively low, yet above-baseline, accuracy in CLIP's recognition of emotions for the abstract images and rationales. This suggests that CLIP does not fully decode the emotional complexities of abstract art in a manner that aligns with human cognitive processes. The researchers also found that CLIP exhibited stronger color-emotion associations, such as linking the color red more strongly to "anger," compared to the human-provided annotations.

These findings highlight the disparity between human and machine processing when it comes to connecting visual features, like color, with emotional responses. The researchers note that this work contributes to the ongoing exploration of the cognitive plausibility of large, pretrained multimodal models like CLIP and the need to further understand their limitations in capturing nuanced human experiences.

Critical Analysis

The researchers acknowledge several caveats and limitations in their study. First, the dataset they used, while comprehensive, may not fully capture the breadth of emotional responses to abstract art. Additionally, the textual rationales provided by human annotators could be influenced by individual biases and may not perfectly reflect the subjective emotional experience.

Furthermore, the researchers note that the zero-shot emotion classification task, while informative, does not necessarily reflect the broader capabilities of the CLIP model. CLIP may perform better on emotion recognition when fine-tuned or used in different contexts. The researchers also suggest that exploring the internal representations and decision-making processes of CLIP could provide further insights into its cognitive plausibility.

It's important to consider that the disparity between human and machine processing of emotions in abstract art may not be unique to CLIP, but may be a general limitation of current AI systems. Advancing the field of affective computing and developing more sophisticated models for understanding human emotions and experiences remains an ongoing challenge.

Conclusion

This study provides valuable insights into the cognitive plausibility of the CLIP model in recognizing emotions evoked by abstract visual art. The findings suggest that while CLIP can provide some level of emotion recognition, its processing does not fully align with human cognitive processes when it comes to the nuanced and complex emotional responses to abstract art.

The research highlights the need for continued exploration of the limitations and capabilities of large, pretrained multimodal models like CLIP, particularly in the realm of capturing subjective human experiences. By understanding these gaps, researchers can work towards developing AI systems that can more effectively engage with and understand the emotional complexities of the human experience.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

↗️

Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations

Hanna-Sophia Widhoelzl, Ece Takmaz

This study investigates the cognitive plausibility of a pretrained multimodal model, CLIP, in recognizing emotions evoked by abstract visual art. We employ a dataset comprising images with associated emotion labels and textual rationales of these labels provided by human annotators. We perform linguistic analyses of rationales, zero-shot emotion classification of images and rationales, apply similarity-based prediction of emotion, and investigate color-emotion associations. The relatively low, yet above baseline, accuracy in recognizing emotion for abstract images and rationales suggests that CLIP decodes emotional complexities in a manner not well aligned with human cognitive processes. Furthermore, we explore color-emotion interactions in images and rationales. Expected color-emotion associations, such as red relating to anger, are identified in images and texts annotated with emotion labels by both humans and CLIP, with the latter showing even stronger interactions. Our results highlight the disparity between human processing and machine processing when connecting image features and emotions.

5/13/2024

🤖

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

8/29/2024

Robust Light-Weight Facial Affective Behavior Recognition with CLIP

Li Lin, Sarah Papabathini, Xin Wang, Shu Hu

Human affective behavior analysis aims to delve into human expressions and behaviors to deepen our understanding of human emotions. Basic expression categories (EXPR) and Action Units (AUs) are two essential components in this analysis, which categorize emotions and break down facial movements into elemental units, respectively. Despite advancements, existing approaches in expression classification and AU detection often necessitate complex models and substantial computational resources, limiting their applicability in everyday settings. In this work, we introduce the first lightweight framework adept at efficiently tackling both expression classification and AU detection. This framework employs a frozen CLIP image encoder alongside a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and a loss landscape flattening strategy for improved generalization. Experimental results on the Aff-wild2 dataset demonstrate superior performance in comparison to the baseline while maintaining minimal computational demands, offering a practical solution for affective behavior analysis. The code is available at https://github.com/Purdue-M2/Affective_Behavior_Analysis_M2_PURDUE

9/10/2024

🖼️

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Martha Lewis, Nihal V. Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H. Bach, Ellie Pavlick

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying red cube by reasoning over the constituents red and cube. In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating cube behind sphere from sphere behind cube). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.

9/2/2024