CLIPScope: Enhancing Zero-Shot OOD Detection with Bayesian Scoring

2405.14737

Published 5/24/2024 by Hao Fu, Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami

🔎

Abstract

Detection of out-of-distribution (OOD) samples is crucial for safe real-world deployment of machine learning models. Recent advances in vision language foundation models have made them capable of detecting OOD samples without requiring in-distribution (ID) images. However, these zero-shot methods often underperform as they do not adequately consider ID class likelihoods in their detection confidence scoring. Hence, we introduce CLIPScope, a zero-shot OOD detection approach that normalizes the confidence score of a sample by class likelihoods, akin to a Bayesian posterior update. Furthermore, CLIPScope incorporates a novel strategy to mine OOD classes from a large lexical database. It selects class labels that are farthest and nearest to ID classes in terms of CLIP embedding distance to maximize coverage of OOD samples. We conduct extensive ablation studies and empirical evaluations, demonstrating state of the art performance of CLIPScope across various OOD detection benchmarks.

Create account to get full access

Overview

Detecting out-of-distribution (OOD) samples is crucial for safely deploying machine learning models in the real world.
Recent advances in vision-language foundation models have enabled zero-shot OOD detection without the need for in-distribution (ID) images.
However, these zero-shot methods often underperform because they do not adequately consider ID class likelihoods in their detection confidence scoring.
The paper introduces CLIPScope, a zero-shot OOD detection approach that normalizes the confidence score of a sample by class likelihoods, similar to a Bayesian posterior update.
CLIPScope also incorporates a novel strategy to mine OOD classes from a large lexical database, selecting class labels that are farthest and nearest to ID classes in terms of CLIP embedding distance to maximize coverage of OOD samples.

Plain English Explanation

Machine learning models are increasingly being used in real-world applications, such as self-driving cars or medical diagnosis. However, these models can sometimes encounter situations that are very different from the data they were trained on, known as out-of-distribution (OOD) samples. If a model is not able to detect these OOD samples, it can make unreliable or dangerous decisions.

Recent advances in a type of AI model called vision-language foundation models have made it possible to detect OOD samples without needing to see a lot of examples of the normal, in-distribution (ID) data. These zero-shot OOD detection methods work by looking at the similarity between the input and the known classes the model was trained on.

However, these zero-shot methods often don't perform as well as they could, because they don't fully consider the likelihood of the input belonging to each known class. The CLIPScope approach introduced in this paper tries to address this by normalizing the model's confidence score based on the likelihoods of the known classes, similar to how a Bayesian statistical model would update its beliefs.

Additionally, CLIPScope uses a novel strategy to find OOD classes that are very different from the known ID classes. It does this by looking at how distant the OOD class labels are from the ID classes in the language model's representation space. This helps the model cover a wider range of possible OOD samples.

By incorporating these improvements, the researchers show that CLIPScope can achieve state-of-the-art performance on standard OOD detection benchmarks.

Technical Explanation

The paper introduces CLIPScope, a zero-shot out-of-distribution (OOD) detection approach that builds on recent advances in vision-language foundation models like CLIP.

Unlike previous zero-shot OOD detection methods that directly use the CLIP model's classification confidence as the OOD score, CLIPScope normalizes this confidence by the class likelihoods, akin to a Bayesian posterior update. This helps the model better distinguish OOD samples from in-distribution (ID) samples.

Additionally, the paper proposes a novel strategy to mine OOD class labels from a large lexical database. It selects classes that are both farthest and nearest to the ID classes in CLIP's embedding space. This maximizes the coverage of potential OOD samples, as the model can detect samples that are very different from or very similar to the known ID classes.

The researchers conduct extensive ablation studies and evaluations on various OOD detection benchmarks, demonstrating that CLIPScope achieves state-of-the-art performance. They compare their approach to recent related work and highlight the importance of the proposed normalization and OOD class mining strategies.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated approach to zero-shot OOD detection using vision-language models. The key innovations, such as the normalization of confidence scores and the novel OOD class mining strategy, are well-justified and show significant performance improvements over prior work.

However, the paper does not extensively discuss the limitations or potential issues with the proposed method. For example, it would be interesting to understand how CLIPScope performs on more challenging or adversarial OOD samples, or how sensitive the approach is to the choice of the lexical database used for OOD class mining.

Additionally, the paper could have provided more insights into the interpretability and explainability of the OOD detection process. Understanding why the model flags certain samples as OOD could be valuable for building trust and ensuring the safety of these systems in real-world deployments.

Overall, the research presented in the paper is of high quality and makes a valuable contribution to the field of OOD detection. However, further exploration of the method's limitations and potential improvements could strengthen the work and provide a more comprehensive understanding of its capabilities and limitations.

Conclusion

The paper introduces CLIPScope, a novel zero-shot OOD detection approach that leverages the strengths of vision-language foundation models while addressing their shortcomings. By normalizing the confidence scores and strategically mining OOD class labels, CLIPScope achieves state-of-the-art performance on standard OOD detection benchmarks.

This research is a significant step forward in making machine learning models more robust and reliable for real-world deployment. The ability to accurately detect OOD samples without requiring extensive in-distribution data is crucial for the safe and responsible use of these technologies in high-stakes applications. The insights and techniques presented in this paper can inspire further advancements in the field of OOD detection and contribute to the development of more trustworthy and transparent AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Zero-Shot Out-of-Distribution Detection with Outlier Label Exposure

Choubo Ding, Guansong Pang

As vision-language models like CLIP are widely applied to zero-shot tasks and gain remarkable performance on in-distribution (ID) data, detecting and rejecting out-of-distribution (OOD) inputs in the zero-shot setting have become crucial for ensuring the safety of using such models on the fly. Most existing zero-shot OOD detectors rely on ID class label-based prompts to guide CLIP in classifying ID images and rejecting OOD images. In this work we instead propose to leverage a large set of diverse auxiliary outlier class labels as pseudo OOD class text prompts to CLIP for enhancing zero-shot OOD detection, an approach we called Outlier Label Exposure (OLE). The key intuition is that ID images are expected to have lower similarity to these outlier class prompts than OOD images. One issue is that raw class labels often include noise labels, e.g., synonyms of ID labels, rendering raw OLE-based detection ineffective. To address this issue, we introduce an outlier prototype learning module that utilizes the prompt embeddings of the outlier labels to learn a small set of pivotal outlier prototypes for an embedding similarity-based OOD scoring. Additionally, the outlier classes and their prototypes can be loosely coupled with the ID classes, leading to an inseparable decision region between them. Thus, we also introduce an outlier label generation module that synthesizes our outlier prototypes and ID class embeddings to generate in-between outlier prototypes to further calibrate the detection in OLE. Despite its simplicity, extensive experiments show that OLE substantially improves detection performance and achieves new state-of-the-art performance in large-scale OOD and hard OOD detection benchmarks.

6/4/2024

cs.CV

Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection

Chentao Cao, Zhun Zhong, Zhanke Zhou, Yang Liu, Tongliang Liu, Bo Han

Detecting out-of-distribution (OOD) samples is essential when deploying machine learning models in open-world scenarios. Zero-shot OOD detection, requiring no training on in-distribution (ID) data, has been possible with the advent of vision-language models like CLIP. Existing methods build a text-based classifier with only closed-set labels. However, this largely restricts the inherent capability of CLIP to recognize samples from large and open label space. In this paper, we propose to tackle this constraint by leveraging the expert knowledge and reasoning capability of large language models (LLM) to Envision potential Outlier Exposure, termed EOE, without access to any actual OOD data. Owing to better adaptation to open-world scenarios, EOE can be generalized to different tasks, including far, near, and fine-grained OOD detection. Technically, we design (1) LLM prompts based on visual similarity to generate potential outlier class labels specialized for OOD detection, as well as (2) a new score function based on potential outlier penalty to distinguish hard OOD samples effectively. Empirically, EOE achieves state-of-the-art performance across different OOD tasks and can be effectively scaled to the ImageNet-1K dataset. The code is publicly available at: https://github.com/tmlr-group/EOE.

6/4/2024

cs.LG

Enhancing Near OOD Detection in Prompt Learning: Maximum Gains, Minimal Costs

Myong Chol Jung, He Zhao, Joanna Dipnall, Belinda Gabbe, Lan Du

Prompt learning has shown to be an efficient and effective fine-tuning method for vision-language models like CLIP. While numerous studies have focused on the generalisation of these models in few-shot classification, their capability in near out-of-distribution (OOD) detection has been overlooked. A few recent works have highlighted the promising performance of prompt learning in far OOD detection. However, the more challenging task of few-shot near OOD detection has not yet been addressed. In this study, we investigate the near OOD detection capabilities of prompt learning models and observe that commonly used OOD scores have limited performance in near OOD detection. To enhance the performance, we propose a fast and simple post-hoc method that complements existing logit-based scores, improving near OOD detection AUROC by up to 11.67% with minimal computational cost. Our method can be easily applied to any prompt learning model without change in architecture or re-training the models. Comprehensive empirical evaluations across 13 datasets and 8 models demonstrate the effectiveness and adaptability of our method.

5/28/2024

cs.CV

Exploiting Diffusion Prior for Out-of-Distribution Detection

Armando Zhu, Jiabei Liu, Keqin Li, Shuying Dai, Bo Hong, Peng Zhao, Changsong Wei

Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models, especially in areas where security is critical. However, traditional OOD detection methods often fail to capture complex data distributions from large scale date. In this paper, we present a novel approach for OOD detection that leverages the generative ability of diffusion models and the powerful feature extraction capabilities of CLIP. By using these features as conditional inputs to a diffusion model, we can reconstruct the images after encoding them with CLIP. The difference between the original and reconstructed images is used as a signal for OOD identification. The practicality and scalability of our method is increased by the fact that it does not require class-specific labeled ID data, as is the case with many other methods. Extensive experiments on several benchmark datasets demonstrates the robustness and effectiveness of our method, which have significantly improved the detection accuracy.

6/18/2024

cs.CV cs.AI