Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Read original: arXiv:2312.07495 - Published 8/13/2024 by Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao

🤷

Overview

This paper focuses on a challenging problem called multi-class unsupervised anomaly detection (MUAD).
MUAD requires training on only normal images while testing on both normal and anomaly images across multiple classes.
Existing methods often use complex encoder-decoder architectures, while this paper explores the use of a simpler Vision Transformer (ViT) approach.

Plain English Explanation

The paper explores a problem called multi-class unsupervised anomaly detection (MUAD). In this problem, the goal is to detect when an image is "abnormal" or different from the normal images used to train the system. However, the system needs to do this across multiple classes of images, not just one.

Existing approaches to this problem often use complex neural network architectures, with specialized modules and handcrafted features. In contrast, the authors of this paper investigate using a simpler model - a Vision Transformer (ViT) - which has proven effective in other computer vision tasks.

The key idea is to see if the basic ViT features, without any additional complexity, can provide a strong baseline for multi-class anomaly detection. This could help drive future research in this area by providing a strong starting point.

Technical Explanation

The paper first abstracts a general "Meta-AD" concept by synthesizing current reconstruction-based anomaly detection methods. It then proposes a novel ViT-based ViTAD structure, designed from both global and local perspectives. This ViTAD model serves as a strong baseline for future work on MUAD.

The authors comprehensively benchmark various approaches using eight different evaluation metrics. Notably, using only a basic mean squared error (MSE) loss function, the ViTAD model achieves state-of-the-art results on several datasets, including MVTec AD, VisA, and Uni-Medical. For example, on the MVTec AD dataset, ViTAD achieves an 85.4 mAD score, surpassing the previous UniAD method by 3.0 points.

Importantly, ViTAD is also highly efficient, requiring only 1.1 hours and 2.3G of GPU memory to complete model training on a single V100 GPU. This efficiency makes ViTAD a strong baseline that can facilitate future research in this area.

Critical Analysis

The paper provides a thoughtful and rigorous exploration of the MUAD problem, offering valuable insights and a strong baseline model. However, a few potential areas for further investigation are mentioned:

The authors note that their ViTAD model, while effective, may still have room for improvement, especially in terms of accounting for local visual cues more explicitly.
The paper does not delve deeply into the reasons behind the strong performance of the ViT-based approach. Further analysis of the model's internal workings could yield additional insights.
The paper focuses on static image datasets, so extending the ViTAD approach to video or other dynamic data sources could be an interesting future direction.

Overall, the paper presents a compelling case for the effectiveness of a simple ViT-based model in addressing the challenging MUAD problem. The work serves as a valuable contribution to the field and provides a solid foundation for future research.

Conclusion

This paper explores a challenging problem called multi-class unsupervised anomaly detection (MUAD), where the goal is to detect abnormal images across multiple classes using only normal images for training. The authors propose a novel ViT-based ViTAD model that achieves state-of-the-art results on several benchmark datasets, while being highly efficient.

The key contribution of this work is demonstrating the effectiveness of a straightforward ViT-based approach, which serves as a strong baseline to facilitate future research in this area. The paper also uncovers several interesting directions for further investigation, such as enhancing the model's ability to capture local visual cues and exploring its applicability to dynamic data sources.

Overall, this research provides valuable insights and a practical solution to the MUAD problem, paving the way for continued advancements in the field of anomaly detection.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.

8/13/2024

Learning Multi-view Anomaly Detection

Haoyang He, Jiangning Zhang, Guanzhong Tian, Chengjie Wang, Lei Xie

This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. Single-view tasks would encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the textbf{M}ulti-textbf{V}iew textbf{A}nomaly textbf{D}etection (textbf{MVAD}) framework, which learns and integrates features from multi-views. Specifically, we proposed a textbf{M}ulti-textbf{V}iew textbf{A}daptive textbf{S}election (textbf{MVAS}) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is a conducted attention mechanism for each single-view window and the top-K most correlated multi-view windows. Adjusting the window sizes and top-K can minimise the computational complexity to linear. Extensive experiments on the Real-IAD dataset for cross-setting (multi/single-class) validate the effectiveness of our approach, achieving state-of-the-art performance among sample textbf{4.1%}$uparrow$/ image textbf{5.6%}$uparrow$/pixel textbf{6.7%}$uparrow$ levels with a total of ten metrics with only textbf{18M} parameters and fewer GPU memory and training time.

7/17/2024

🤷

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang, Huiqi Li

Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images, serving as an alternative to the conventional one-class-one-model setup. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across three popular anomaly detection benchmarks including MVTec-AD, VisA, and the recently released Real-IAD. Our proposed Dinomaly achieves impressive image AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also surpasses the most advanced class-separated UAD records.

5/30/2024

✨

Learning Feature Inversion for Multi-class Anomaly Detection under General-purpose COCO-AD Benchmark

Jiangning Zhang, Chengjie Wang, Xiangtai Li, Guanzhong Tian, Zhucun Xue, Yong Liu, Guansong Pang, Dacheng Tao

Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at https://github.com/zhangzjn/ader.

4/17/2024