Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification

Read original: arXiv:2408.09449 - Published 8/20/2024 by Xin Liu, Weijia Zhang, Min-Ling Zhang

Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification

Overview

Examines the use of attention mechanisms in multi-instance learning for whole slide image classification
Finds that attention is not necessary for strong performance, challenging the prevailing view
Proposes a simple but effective model that outperforms attention-based approaches

Plain English Explanation

The paper "Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification" looks at a type of machine learning problem called whole slide image classification. In this task, the goal is to classify an entire medical slide image, such as a tissue sample, into different disease categories.

The researchers noticed that many recent approaches to this problem have relied on "attention" mechanisms - techniques that try to identify the most important parts of the image. However, the paper argues that attention may not actually be necessary to achieve strong performance on whole slide image classification.

Instead, the researchers propose a simpler model that doesn't use attention, but still outperforms the attention-based approaches. The key insight is that the model can effectively learn to classify the whole slide image without needing to explicitly highlight the most important regions.

This finding challenges the prevailing view in the field and suggests that attention may not always be the best solution, even for complex visual tasks like whole slide image classification. The paper demonstrates that simpler models can sometimes perform just as well, or even better, than more sophisticated attention-based approaches.

Technical Explanation

The paper proposes a new multi-instance learning model for whole slide image classification that does not use attention mechanisms. The model instead takes a global approach, learning to classify the entire slide image without explicitly highlighting the most important regions.

The researchers compare their proposed model to several attention-based approaches, including multi-head attention and attention-based deep multiple instance learning. They evaluate the models on several benchmark datasets for whole slide image classification, and find that their simpler, non-attention-based model outperforms the attention-based approaches.

The key innovation is the use of a global pooling layer, which aggregates information from the entire slide image without relying on attention. This allows the model to learn effective representations for classification without the need to identify the most salient regions. The researchers also incorporate techniques like self-supervised pretraining to further improve performance.

Through extensive experimentation, the paper demonstrates that attention is not a necessary component for achieving strong results on whole slide image classification tasks. This challenges the prevailing view in the field and suggests that simpler, more efficient models may be preferable in certain scenarios.

Critical Analysis

The paper provides a valuable contribution to the field of whole slide image classification by challenging the widespread use of attention mechanisms. The researchers make a compelling case that attention may not always be necessary, and that simpler models can sometimes outperform more sophisticated attention-based approaches.

However, the paper does not fully address the limitations of their proposed model. For example, it is unclear how the model would perform on more complex or nuanced classification tasks, where attention may be more crucial for identifying the relevant features. Additionally, the paper does not explore the potential trade-offs, such as computational efficiency or interpretability, between the attention-based and non-attention-based models.

Further research is needed to fully understand the scope and limitations of the proposed approach. It would be valuable to see the model tested on a wider range of datasets and tasks, and to investigate the factors that determine when attention is or is not beneficial for whole slide image classification.

Conclusion

This paper challenges the prevailing view in the field of whole slide image classification by demonstrating that attention mechanisms are not always necessary for strong performance. The researchers propose a simple, non-attention-based model that outperforms several attention-based approaches on benchmark datasets.

The findings suggest that simpler models can be effective for complex visual tasks, and that attention may not always be the best solution. This has important implications for the design of machine learning models in the medical imaging domain, as it highlights the potential value of exploring alternative architectures beyond the attention-based approaches that have dominated the field.

Overall, the paper provides a thought-provoking perspective on the role of attention in whole slide image classification, and encourages researchers to think critically about the appropriate use of sophisticated techniques like attention in complex visual recognition tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Attention Is Not What You Need: Revisiting Multi-Instance Learning for Whole Slide Image Classification

Xin Liu, Weijia Zhang, Min-Ling Zhang

Although attention-based multi-instance learning algorithms have achieved impressive performances on slide-level whole slide image (WSI) classification tasks, they are prone to mistakenly focus on irrelevant patterns such as staining conditions and tissue morphology, leading to incorrect patch-level predictions and unreliable interpretability. Moreover, these attention-based MIL algorithms tend to focus on salient instances and struggle to recognize hard-to-classify instances. In this paper, we first demonstrate that attention-based WSI classification methods do not adhere to the standard MIL assumptions. From the standard MIL assumptions, we propose a surprisingly simple yet effective instance-based MIL method for WSI classification (FocusMIL) based on max-pooling and forward amortized variational inference. We argue that synergizing the standard MIL assumption with variational inference encourages the model to focus on tumour morphology instead of spurious correlations. Our experimental evaluations show that FocusMIL significantly outperforms the baselines in patch-level classification tasks on the Camelyon16 and TCGA-NSCLC benchmarks. Visualization results show that our method also achieves better classification boundaries for identifying hard instances and mitigates the effect of spurious correlations between bags and labels.

8/20/2024

🖼️

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

Yunlong Zhang, Honglin Li, Yuxuan Sun, Sunyi Zheng, Chenglu Zhu, Lin Yang

In the application of Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) classification, attention mechanisms often focus on a subset of discriminative instances, which are closely linked to overfitting. To mitigate overfitting, we present Attention-Challenging MIL (ACMIL). ACMIL combines two techniques based on separate analyses for attention value concentration. Firstly, UMAP of instance features reveals various patterns among discriminative instances, with existing attention mechanisms capturing only some of them. To remedy this, we introduce Multiple Branch Attention (MBA) to capture more discriminative instances using multiple attention branches. Secondly, the examination of the cumulative value of Top-K attention scores indicates that a tiny number of instances dominate the majority of attention. In response, we present Stochastic Top-K Instance Masking (STKIM), which masks out a portion of instances with Top-K attention values and allocates their attention values to the remaining instances. The extensive experimental results on three WSI datasets with two pre-trained backbones reveal that our ACMIL outperforms state-of-the-art methods. Additionally, through heatmap visualization and UMAP visualization, this paper extensively illustrates ACMIL's effectiveness in suppressing attention value concentration and overcoming the overfitting challenge. The source code is available at url{https://github.com/dazhangyu123/ACMIL}.

7/8/2024

Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need

Linhao Qu, Yingfan Ma, Xiaoyuan Luo, Manning Wang, Zhijian Song

Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifier, where the attention scores can be used for instance-level classification. However, the pseudo instance labels constructed by the former usually contain a lot of noise, and the attention scores constructed by the latter are not accurate enough, both of which affect their performance. In this paper, we propose an instance-level MIL framework based on contrastive learning and prototype learning to effectively accomplish both instance classification and bag classification tasks. To this end, we propose an instance-level weakly supervised contrastive learning algorithm for the first time under the MIL setting to effectively learn instance feature representation. We also propose an accurate pseudo label generation method through prototype learning. We then develop a joint training strategy for weakly supervised contrastive learning, prototype learning, and instance classifier training. Extensive experiments and visualizations on four datasets demonstrate the powerful performance of our method. Codes are available at https://github.com/miccaiif/INS.

5/14/2024

🖼️

Distilling High Diagnostic Value Patches for Whole Slide Image Classification Using Attention Mechanism

Tianhang Nan, Hao Quan, Yong Ding, Xingyu Li, Kai Yang, Xiaoyu Cui

Multiple Instance Learning (MIL) has garnered widespread attention in the field of Whole Slide Image (WSI) classification as it replaces pixel-level manual annotation with diagnostic reports as labels, significantly reducing labor costs. Recent research has shown that bag-level MIL methods often yield better results because they can consider all patches of the WSI as a whole. However, a drawback of such methods is the incorporation of more redundant patches, leading to interference. To extract patches with high diagnostic value while excluding interfering patches to address this issue, we developed an attention-based feature distillation multi-instance learning (AFD-MIL) approach. This approach proposed the exclusion of redundant patches as a preprocessing operation in weakly supervised learning, directly mitigating interference from extensive noise. It also pioneers the use of attention mechanisms to distill features with high diagnostic value, as opposed to the traditional practice of indiscriminately and forcibly integrating all patches. Additionally, we introduced global loss optimization to finely control the feature distillation module. AFD-MIL is orthogonal to many existing MIL methods, leading to consistent performance improvements. This approach has surpassed the current state-of-the-art method, achieving 91.47% ACC (accuracy) and 94.29% AUC (area under the curve) on the Camelyon16 (Camelyon Challenge 2016, breast cancer), while 93.33% ACC and 98.17% AUC on the TCGA-NSCLC (The Cancer Genome Atlas Program: non-small cell lung cancer). Different feature distillation methods were used for the two datasets, tailored to the specific diseases, thereby improving performance and interpretability.

8/19/2024