AnomalyDINO: Boosting Patch-based Few-shot Anomaly Detection with DINOv2

2405.14529

Published 5/24/2024 by Simon Damm, Mike Laszkiewicz, Johannes Lederer, Asja Fischer

❗

Abstract

Recent advances in multimodal foundation models have set new standards in few-shot anomaly detection. This paper explores whether high-quality visual features alone are sufficient to rival existing state-of-the-art vision-language models. We affirm this by adapting DINOv2 for one-shot and few-shot anomaly detection, with a focus on industrial applications. We show that this approach does not only rival existing techniques but can even outmatch them in many settings. Our proposed vision-only approach, AnomalyDINO, is based on patch similarities and enables both image-level anomaly prediction and pixel-level anomaly segmentation. The approach is methodologically simple and training-free and, thus, does not require any additional data for fine-tuning or meta-learning. Despite its simplicity, AnomalyDINO achieves state-of-the-art results in one- and few-shot anomaly detection (e.g., pushing the one-shot performance on MVTec-AD from an AUROC of 93.1% to 96.6%). The reduced overhead, coupled with its outstanding few-shot performance, makes AnomalyDINO a strong candidate for fast deployment, for example, in industrial contexts.

Create account to get full access

Overview

Recent advancements in multimodal foundation models have set new standards for few-shot anomaly detection.
This paper explores whether high-quality visual features alone can rival existing state-of-the-art vision-language models.
The researchers affirm this by adapting DINOv2 for one-shot and few-shot anomaly detection, focusing on industrial applications.
Their proposed vision-only approach, AnomalyDINO, is based on patch similarities and enables both image-level anomaly prediction and pixel-level anomaly segmentation.
AnomalyDINO is methodologically simple, training-free, and does not require additional data for fine-tuning or meta-learning.

Plain English Explanation

The paper looks at whether using just visual information, without any additional language data, can match or even outperform existing methods for detecting anomalies in images. The researchers take a popular computer vision model called DINOv2 and adapt it for the task of one-shot and few-shot anomaly detection, which means the model can detect anomalies after seeing just one or a few examples.

Their approach, called AnomalyDINO, works by looking at the similarities between different patches or regions in an image. It can not only predict if an entire image contains an anomaly, but also pinpoint exactly where that anomaly is located. Importantly, AnomalyDINO doesn't require any additional training or fine-tuning - it can be used right away without needing more data. This makes it a strong candidate for fast deployment, especially in industrial settings like factory quality control.

Technical Explanation

The researchers adapted the DINOv2 model, which was originally designed for self-supervised visual representation learning, to perform one-shot and few-shot anomaly detection. DINOv2 learns high-quality visual features without any explicit supervision.

AnomalyDINO, the proposed approach, uses these visual features to detect and segment anomalies. It compares the similarity of image patches to identify regions that are anomalous compared to the "normal" appearance learned by the model. This allows AnomalyDINO to not only classify an entire image as anomalous or not, but also pinpoint the specific locations of any anomalies.

Importantly, AnomalyDINO requires no additional training or fine-tuning. It can be directly applied to new tasks and datasets without the need for any extra data or learning. This is in contrast to many existing few-shot anomaly detection and few-shot object detection techniques that rely on meta-learning or other specialized training procedures.

Critical Analysis

The paper provides a compelling demonstration of how high-quality visual features alone can rival and even outperform existing vision-language models for anomaly detection. The simplicity and training-free nature of AnomalyDINO are particularly noteworthy, as they enable fast deployment in real-world industrial settings.

However, the paper does not thoroughly explore the limitations of this approach. For example, it is unclear how AnomalyDINO would perform on more complex or subtle anomalies that may require additional contextual information beyond just visual appearance. The researchers also do not discuss potential biases or failure modes of their patch-based anomaly detection technique.

Additionally, while the results on the MVTec-AD dataset are impressive, further evaluation on a broader range of anomaly detection benchmarks would help strengthen the claims about AnomalyDINO's generalization capabilities. Prompt-based anomaly detection approaches could also be an interesting area for comparison.

Overall, the paper makes a strong case for the effectiveness of visual-only anomaly detection methods, but further research is needed to fully understand the strengths, limitations, and real-world applicability of the AnomalyDINO approach.

Conclusion

This paper demonstrates that high-quality visual features alone, as captured by the DINOv2 model, can rival and even outperform existing state-of-the-art vision-language models for one-shot and few-shot anomaly detection. The researchers' proposed approach, AnomalyDINO, is methodologically simple, training-free, and enables both image-level anomaly prediction and pixel-level anomaly segmentation.

The reduced overhead and outstanding few-shot performance of AnomalyDINO make it a strong candidate for fast deployment in industrial contexts, where anomaly detection is crucial for quality control and process optimization. While the paper highlights the impressive capabilities of visual-only anomaly detection, further research is needed to fully understand the limitations and generalization potential of this approach across diverse anomaly detection scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤷

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang, Huiqi Li

Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images, serving as an alternative to the conventional one-class-one-model setup. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across three popular anomaly detection benchmarks including MVTec-AD, VisA, and the recently released Real-IAD. Our proposed Dinomaly achieves impressive image AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also surpasses the most advanced class-separated UAD records.

5/30/2024

cs.CV

A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

Reda Bensaid, Vincent Gripon, Franc{c}ois Leduc-Primeau, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux

In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives

4/4/2024

cs.CV

Anomaly Multi-classification in Industrial Scenarios: Transferring Few-shot Learning to a New Task

Jie Liu, Yao Wu, Xiaotong Luo, Zongze Wu

In industrial scenarios, it is crucial not only to identify anomalous items but also to classify the type of anomaly. However, research on anomaly multi-classification remains largely unexplored. This paper proposes a novel and valuable research task called anomaly multi-classification. Given the challenges in applying few-shot learning to this task, due to limited training data and unique characteristics of anomaly images, we introduce a baseline model that combines RelationNet and PatchCore. We propose a data generation method that creates pseudo classes and a corresponding proxy task, aiming to bridge the gap in transferring few-shot learning to industrial scenarios. Furthermore, we utilize contrastive learning to improve the vanilla baseline, achieving much better performance than directly fine-tune a ResNet. Experiments conducted on MvTec AD and MvTec3D AD demonstrate that our approach shows superior performance in this novel task.

6/18/2024

cs.CV cs.AI cs.LG

Revisiting Few-Shot Object Detection with Vision-Language Models

Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan

The era of vision-language models (VLMs) trained on large web-scale datasets challenges conventional formulations of open-world perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundational models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.9 mAP!

6/17/2024

cs.CV