Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

2212.06872

Published 6/26/2024 by Mingqi Jiang, Saeed Khorram, Li Fuxin

👀

Abstract

In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.

Create account to get full access

Overview

The paper proposes two methodologies, sub-explanation counting and cross-testing, to systematically analyze the decision-making of different visual recognition models.
These methodologies reveal differences in the way models like Transformers, ConvNeXt, traditional CNNs, and distilled Transformers make decisions.
The key findings include:
- Transformers and ConvNeXt are more compositional, considering multiple parts of the image jointly.
- Traditional CNNs and distilled Transformers are less compositional and more disjunctive, using diverse smaller sets of parts.
- The choice of normalization technique (batch, group, or layer) significantly impacts a model's compositionality.
The paper also analyzes the feature similarities between different model backbones.

Plain English Explanation

The researchers wanted to understand how different computer vision models make decisions when analyzing images. They developed two new methods to systematically examine the inner workings of these models.

One method, called "sub-explanation counting," looks at the specific image regions that a model focuses on to make its prediction. The other method, "cross-testing," compares the explanations generated by different models for the same images.

Using these techniques, the researchers found some interesting differences between model types. Transformer and ConvNeXt models tend to consider multiple parts of the image together when making a decision. This is called being "compositional." In contrast, traditional convolutional neural networks (CNNs) and distilled Transformers are more "disjunctive" - they use diverse but smaller sets of image parts to reach their conclusions.

The researchers also discovered that the choice of "normalization" technique used in the model architecture (batch, group, or layer normalization) has a big impact on how compositional the model is. Batch normalization leads to less compositionality, while group and layer normalization result in more.

Finally, the paper analyzed the common features that different model backbones share and don't share, allowing them to map out the "landscape" of how these models relate to each other.

Technical Explanation

The paper introduces two new methodologies to gain insights into the decision-making of various visual recognition models:

Sub-Explanation Counting: This involves applying deep explanation algorithms (like concept activation vectors) to a dataset and counting the number of unique sub-explanations (image regions) used by the model to make its predictions. This provides a measure of the model's compositionality.
Cross-Testing: This compares the sub-explanations generated by different models on the same set of images. It reveals the degree of overlap or disjointness in the parts of the image each model focuses on.

Using these methodologies, the researchers found that Transformers and ConvNeXt models tend to be more compositional, considering multiple image parts jointly. Conversely, traditional CNNs and distilled Transformers are more disjunctive, relying on smaller, more diverse sets of image parts.

Further experiments showed that the choice of normalization technique is a key factor influencing compositionality. Batch normalization leads to less compositional decision-making, while group and layer normalization result in more compositional models.

The paper also analyzes the shared features across different model backbones and visualizes a "landscape" of their feature-use similarities.

Critical Analysis

The paper provides a novel and insightful analysis of the decision-making processes underlying different visual recognition models. The proposed methodologies offer a systematic way to unpack model behavior and reveal key architectural differences.

However, the paper does not address potential limitations of the sub-explanation counting and cross-testing approaches. For example, the choice of explanation algorithm and the granularity at which sub-explanations are defined could impact the findings. Additionally, the dataset used for the analysis may not be fully representative of real-world visual recognition challenges.

Further research is needed to understand how these insights around compositionality and disjunctivism translate to model performance and robustness in practical applications. The paper's conclusions could also be strengthened by exploring the connection between concept-based analysis and other explainability techniques.

Conclusion

This paper presents a novel approach to analyzing the decision-making processes of different visual recognition models. By systematically applying deep explanation algorithms and comparing the generated sub-explanations, the researchers were able to uncover key differences in the way Transformers, ConvNeXt, CNNs, and distilled Transformers make predictions.

The findings on compositionality and disjunctivism provide valuable insights into the inner workings of these models and suggest that architectural choices, such as the normalization technique, can significantly impact how a model arrives at its decisions. This understanding could inform the design of more transparent and interpretable computer vision systems in the future.

Overall, this paper contributes an important step towards demystifying the black box of deep learning and opens up new avenues for further research in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CNN-based explanation ensembling for dataset, representation and explanations evaluation

Weronika Hryniewska-Guzik, Luca Longo, Przemys{l}aw Biecek

Explainable Artificial Intelligence has gained significant attention due to the widespread use of complex deep learning models in high-stake domains such as medicine, finance, and autonomous cars. However, different explanations often present different aspects of the model's behavior. In this research manuscript, we explore the potential of ensembling explanations generated by deep classification models using convolutional model. Through experimentation and analysis, we aim to investigate the implications of combining explanations to uncover a more coherent and reliable patterns of the model's behavior, leading to the possibility of evaluating the representation learned by the model. With our method, we can uncover problems of under-representation of images in a certain class. Moreover, we discuss other side benefits like features' reduction by replacing the original image with its explanations resulting in the removal of some sensitive information. Through the use of carefully selected evaluation metrics from the Quantus library, we demonstrated the method's superior performance in terms of Localisation and Faithfulness, compared to individual explanations.

4/17/2024

cs.AI cs.CV

🖼️

Attention as a Hypernetwork

Simon Schug, Seijin Kobayashi, Yassir Akram, Jo~ao Sacramento, Razvan Pascanu

Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is highly structured, capturing information about the subtasks performed by the network. Using the framework of attention as a hypernetwork we further propose a simple modification of multi-head linear attention that strengthens the ability for compositional generalization on a range of abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven Progressive Matrices human intelligence test on which we demonstrate how scaling model size and data enables compositional generalization and gives rise to a functionally structured latent code in the transformer.

6/24/2024

cs.LG

Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

Sonia Bbouzidi, Ghazala Hcini, Imen Jdey, Fadoua Drira

Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.

6/6/2024

cs.CV cs.LG

Less is More: Discovering Concise Network Explanations

Neehar Kondapaneni, Markus Marks, Oisin MacAodha, Pietro Perona

We introduce Discovering Conceptual Network Explanations (DCNE), a new approach for generating human-comprehensible visual explanations to enhance the interpretability of deep neural image classifiers. Our method automatically finds visual explanations that are critical for discriminating between classes. This is achieved by simultaneously optimizing three criteria: the explanations should be few, diverse, and human-interpretable. Our approach builds on the recently introduced Concept Relevance Propagation (CRP) explainability method. While CRP is effective at describing individual neuronal activations, it generates too many concepts, which impacts human comprehension. Instead, DCNE selects the few most important explanations. We introduce a new evaluation dataset centered on the challenging task of classifying birds, enabling us to compare the alignment of DCNE's explanations to those of human expert-defined ones. Compared to existing eXplainable Artificial Intelligence (XAI) methods, DCNE has a desirable trade-off between conciseness and completeness when summarizing network explanations. It produces 1/30 of CRP's explanations while only resulting in a slight reduction in explanation quality. DCNE represents a step forward in making neural network decisions accessible and interpretable to humans, providing a valuable tool for both researchers and practitioners in XAI and model alignment.

6/17/2024

cs.CV