What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets

Read original: arXiv:2409.02241 - Published 9/14/2024 by Maytus Piriyajitakonkij, Sirawaj Itthipuripat, Ian Ballard, Ioannis Pappas

What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets

Overview

The paper explores how low-level visual properties like shape and texture can influence high-level visual perception, even when the high-level properties (like whether an image depicts a face or a hat) are held constant.
The researchers used a technique called "image triplets" to decouple low-level and high-level visual properties, and then trained deep learning models to classify the images.
The results provide insights into how the ventral visual stream processes visual information and makes decisions, with implications for deep learning and neuroscience.

Plain English Explanation

The researchers were interested in understanding how the low-level visual properties of an image, such as its shape and texture, can influence our high-level perception of what the image depicts, even when the high-level properties (like whether it's a face or a hat) are the same.

To explore this, they used a technique called "image triplets." This involves creating sets of three images where two of the images have the same high-level properties (e.g., they both depict faces) but different low-level properties, while the third image has different high-level properties (e.g., it depicts a hat) but similar low-level properties to one of the face images.

By training deep learning models to classify these image triplets, the researchers were able to decouple the influence of low-level and high-level visual properties on the models' decision-making. This provides insights into how the ventral visual stream - the part of the brain that processes visual information and makes decisions about what we see - processes and integrates different types of visual information.

The findings have implications for understanding visual decision-making in both artificial and biological systems, and could contribute to the development of more robust and interpretable deep learning models.

Technical Explanation

The researchers used a dataset of "image triplets" to train deep learning models to classify images as depicting either a face or a hat. Each triplet consisted of three images:

A "target" image, which depicted either a face or a hat.
A "match" image, which had the same high-level properties (i.e., depicted the same type of object) as the target, but different low-level properties (e.g., different shape or texture).
A "lure" image, which had different high-level properties (i.e., depicted a different type of object) than the target, but similar low-level properties to the match image.

By training the models to classify the target images, the researchers were able to assess how the models' decisions were influenced by the low-level properties of the match and lure images, even when the high-level properties were held constant.

The results showed that low-level visual properties like shape and texture can indeed influence high-level visual perception and decision-making, even when the high-level object category is the same. This suggests that the ventral visual stream integrates both low-level and high-level visual information when making decisions about what we see.

Critical Analysis

The researchers acknowledge several limitations of their study. First, the dataset of image triplets was relatively small, which may limit the generalizability of the findings. Additionally, the study only explored a binary classification task (faces vs. hats), and it's unclear whether the results would extend to more complex visual categorization problems.

Furthermore, the researchers note that the deep learning models used in the study were relatively simple, and more sophisticated architectures or training methods may reveal additional insights into the interplay between low-level and high-level visual processing.

Despite these limitations, the study provides an important proof-of-concept for the use of image triplets to decouple low-level and high-level visual properties, and the findings have significant implications for our understanding of how the ventral visual stream processes and integrates visual information. Additional research in this area could lead to more robust and interpretable deep learning models and better insights into the neural mechanisms underlying visual decision-making.

Conclusion

This paper presents a novel approach for studying the interplay between low-level and high-level visual properties, using image triplets and deep learning models. The findings suggest that low-level visual features like shape and texture can significantly influence our high-level perception of objects, even when the high-level properties are held constant.

These insights have important implications for our understanding of how the ventral visual stream processes and integrates visual information, with potential applications in the development of more robust and interpretable deep learning models and improved models of visual decision-making in both artificial and biological systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets

Maytus Piriyajitakonkij, Sirawaj Itthipuripat, Ian Ballard, Ioannis Pappas

In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.

9/14/2024

Parallel Backpropagation for Shared-Feature Visualization

Alexander Lappe, Anna Bogn'ar, Ghazaleh Ghamkhari Nejad, Albert Mukovskiy, Lucas Martini, Martin A. Giese, Rufin Vogels

High-level visual brain regions contain subareas in which neurons appear to respond more strongly to examples of a particular semantic category, like faces or bodies, rather than objects. However, recent work has shown that while this finding holds on average, some out-of-category stimuli also activate neurons in these regions. This may be due to visual features common among the preferred class also being present in other images. Here, we propose a deep-learning-based approach for visualizing these features. For each neuron, we identify relevant visual features driving its selectivity by modelling responses to images based on latent activations of a deep neural network. Given an out-of-category image which strongly activates the neuron, our method first identifies a reference image from the preferred category yielding a similar feature activation pattern. We then backpropagate latent activations of both images to the pixel level, while enhancing the identified shared dimensions and attenuating non-shared features. The procedure highlights image regions containing shared features driving responses of the model neuron. We apply the algorithm to novel recordings from body-selective regions in macaque IT cortex in order to understand why some images of objects excite these neurons. Visualizations reveal object parts which resemble parts of a macaque body, shedding light on neural preference of these objects.

5/17/2024

Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Prayag Tiwari, Josef Bigun

We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks beyond the ones for which they were designed. This work builds on our previous study using off-the-shelf Convolutional Neural Network (CNN) and extends it to include the more recently proposed Vision Transformers (ViT). Despite being trained for generic object classification, middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images. We also demonstrate that CNNs and ViTs are highly complementary since their combination results in boosted accuracy. In addition, we show that a small portion of these pre-trained models can achieve good accuracy, resulting in thinner models with fewer parameters, suitable for resource-limited environments such as mobiles. This efficiency improves if traditional handcrafted features are added as well.

7/30/2024

Layerwise complexity-matched learning yields an improved model of cortical area V2

Nikhil Parthasarathy, Olivier J. H'enaff, Eero P. Simoncelli

Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compared to traditional hand-engineered models, or models optimized for coding efficiency or prediction. Moreover, the gradient backpropagation used in end-to-end learning is generally considered to be biologically implausible. Here, we overcome both of these limitations by developing a bottom-up self-supervised training methodology that operates independently on successive layers. Specifically, we maximize feature similarity between pairs of locally-deformed natural image patches, while decorrelating features across patches sampled from other images. Crucially, the deformation amplitudes are adjusted proportionally to receptive field sizes in each layer, thus matching the task complexity to the capacity at each stage of processing. In comparison with architecture-matched versions of previous models, we demonstrate that our layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2. We demonstrate that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment. Finally, when the two-stage model is used as a fixed front-end for a deep network trained to perform object recognition, the resultant model (LCL-V2Net) is significantly better than standard end-to-end self-supervised, supervised, and adversarially-trained models in terms of generalization to out-of-distribution tasks and alignment with human behavior.

7/22/2024