FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Read original: arXiv:2404.13671 - Published 7/29/2024 by Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, Jinqiao Wang

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Overview

This paper introduces FiLo, a novel approach to zero-shot anomaly detection that combines fine-grained description and high-quality localization.
FiLo leverages vision-language models to provide detailed descriptions of anomalies, enabling it to identify and localize anomalies without any prior training on specific anomaly examples.
The paper presents extensive experiments demonstrating FiLo's superior performance compared to existing zero-shot anomaly detection methods across diverse datasets.

Plain English Explanation

FiLo is a new way to automatically detect and identify unusual or abnormal things in images without any prior examples of what those abnormal things might look like. It does this by using powerful language models that can provide detailed descriptions of the anomalies when they are detected.

Typically, anomaly detection systems need to be trained on many examples of normal and abnormal things in order to learn what to look for. FiLo avoids this by instead using vision-language models that can understand and describe what they see in an image, including any anomalies or unusual elements.

The key insight is that if the model can provide a specific, fine-grained description of an anomaly, then it has effectively identified and localized that anomaly without needing to have seen examples of it before. This "zero-shot" capability is a major advantage over previous anomaly detection methods.

The paper demonstrates that FiLo outperforms other zero-shot anomaly detection approaches across a variety of datasets, showing its versatility and effectiveness. This work represents an important step towards building more flexible and generalizable anomaly detection systems that can adapt to new environments and anomaly types without costly retraining.

Technical Explanation

The core of the FiLo approach is the use of multi-modal vision-language models that can generate fine-grained natural language descriptions of visual content, including anomalies. By leveraging these powerful models, FiLo is able to perform zero-shot anomaly detection - it can identify and localize anomalies without any prior training on specific anomaly examples.

The FiLo system works as follows: first, it uses a vision-language model to produce a detailed textual description of the input image. It then analyzes this description to identify any phrases that indicate the presence of an anomaly. If an anomaly is detected, FiLo uses the description to precisely localize the anomalous region within the image.

The key technical innovation of FiLo is the feature inversion process it employs to enable high-quality localization. Rather than simply highlighting the entire image region corresponding to the anomaly description, FiLo uses an optimization-based approach to refine the localization, ensuring that the identified region tightly encompasses the anomaly.

The paper presents extensive experiments comparing FiLo to prior zero-shot anomaly detection methods on a range of datasets, including natural images, medical images, and industrial inspection scenarios. The results demonstrate FiLo's superior performance in both detection and localization accuracy, validating its effectiveness as a flexible and generalizable anomaly detection solution.

Critical Analysis

The FiLo approach represents an important advancement in zero-shot anomaly detection, leveraging the latest developments in vision-language models to enable a new level of generalizability and performance. However, the paper does acknowledge several limitations and areas for further research.

One key limitation is the reliance on the quality and capabilities of the underlying vision-language model. If the model struggles to provide accurate and detailed descriptions of anomalies, FiLo's performance will be hindered. The authors note that continued improvements in multi-modal models will be crucial for further advancing FiLo's capabilities.

Additionally, while FiLo demonstrates strong results on the evaluated datasets, its performance may be influenced by dataset bias and the specific types of anomalies present. Further research is needed to understand how FiLo will generalize to a wider range of anomaly types and real-world deployment scenarios.

Finally, the computationally intensive nature of the feature inversion process used for localization may limit FiLo's scalability and efficiency, particularly for real-time applications. Exploring more efficient localization approaches could help address this limitation.

Overall, the FiLo paper represents an exciting step forward in anomaly detection, showcasing the power of vision-language models to enable flexible and high-performing systems. However, continued research will be necessary to fully realize the potential of this approach and address its current limitations.

Conclusion

The FiLo paper presents a novel zero-shot anomaly detection system that leverages the capabilities of advanced vision-language models to provide fine-grained description and high-quality localization of anomalies. By avoiding the need for prior training on specific anomaly examples, FiLo demonstrates superior performance compared to existing zero-shot approaches across a range of datasets and application domains.

This work highlights the potential of multi-modal AI systems to enable more flexible and generalizable anomaly detection solutions, which could have significant impacts in areas like industrial inspection, medical imaging, and safety-critical applications. As vision-language models continue to advance, the FiLo approach may pave the way for increasingly robust and adaptable anomaly detection systems that can better keep pace with the evolving needs of real-world environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, Jinqiao Wang

Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing normal or abnormal semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of abnormal often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo.

7/29/2024

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

Yuanwei Li, Elizaveta Ivanova, Martins Bruveris

Automatic image anomaly detection is important for quality inspection in the manufacturing industry. The usual unsupervised anomaly detection approach is to train a model for each object class using a dataset of normal samples. However, a more realistic problem is zero-/few-shot anomaly detection where zero or only a few normal samples are available. This makes the training of object-specific models challenging. Recently, large foundation vision-language models have shown strong zero-shot performance in various downstream tasks. While these models have learned complex relationships between vision and language, they are not specifically designed for the tasks of anomaly detection. In this paper, we propose the Few-shot/zero-shot Anomaly Detection Engine (FADE) which leverages the vision-language CLIP model and adjusts it for the purpose of industrial anomaly detection. Specifically, we improve language-guided anomaly segmentation 1) by adapting CLIP to extract multi-scale image patch embeddings that are better aligned with language and 2) by automatically generating an ensemble of text prompts related to industrial anomaly detection. 3) We use additional vision-based guidance from the query and reference images to further improve both zero-shot and few-shot anomaly detection. On the MVTec-AD (and VisA) dataset, FADE outperforms other state-of-the-art methods in anomaly segmentation with pixel-AUROC of 89.6% (91.5%) in zero-shot and 95.4% (97.5%) in 1-normal-shot. Code is available at https://github.com/BMVC-FADE/BMVC-FADE.

9/4/2024

AnoPLe: Few-Shot Anomaly Detection via Bi-directional Prompt Learning with Only Normal Samples

Yujin Lee, Seoyoon Jang, Hyunsoo Yoon

Few-shot Anomaly Detection (FAD) poses significant challenges due to the limited availability of training samples and the frequent absence of abnormal samples. Previous approaches often rely on annotations or true abnormal samples to improve detection, but such textual or visual cues are not always accessible. To address this, we introduce AnoPLe, a multi-modal prompt learning method designed for anomaly detection without prior knowledge of anomalies. AnoPLe simulates anomalies and employs bidirectional coupling of textual and visual prompts to facilitate deep interaction between the two modalities. Additionally, we integrate a lightweight decoder with a learnable multi-view signal, trained on multi-scale images to enhance local semantic comprehension. To further improve performance, we align global and local semantics, enriching the image-level understanding of anomalies. The experimental results demonstrate that AnoPLe achieves strong FAD performance, recording 94.1% and 86.2% Image AUROC on MVTec-AD and VisA respectively, with only around a 1% gap compared to the SoTA, despite not being exposed to true anomalies. Code is available at https://github.com/YoojLee/AnoPLe.

8/27/2024

CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

Zuo Zuo, Jiahao Dong, Yao Wu, Yanyun Qu, Zongze Wu

Few-shot anomaly detection methods can effectively address data collecting difficulty in industrial scenarios. Compared to 2D few-shot anomaly detection (2D-FSAD), 3D few-shot anomaly detection (3D-FSAD) is still an unexplored but essential task. In this paper, we propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP. We successfully transfer strong generalization ability of CLIP into 3D-FSAD. Specifically, we synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation. For classification, we introduce an image adapter and a text adapter to fine-tune global visual features and text features. Meanwhile, we propose a coarse-to-fine decoder to fuse and facilitate intermediate multi-layer visual representations of CLIP. To benefit from geometry information of point cloud and eliminate modality and data discrepancy when processed by CLIP, we project and render point cloud to multi-view normal and anomalous images. Then we design multi-view fusion module to fuse features of multi-view images extracted by CLIP which are used to facilitate visual representations for further enhancing vision-language correlation. Extensive experiments demonstrate that our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.

6/28/2024