Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class

2404.16717

Published 4/26/2024 by Mazda Moayeri, Michael Rabbat, Mark Ibrahim, Diane Bouchacourt

🏷️

Abstract

Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.

Create account to get full access

Overview

Vision-language models (VLMs) can classify objects in images without any retraining, a significant advance known as zero-shot learning.
However, even the best VLMs struggle when objects appear in forms that differ from their typical depiction, such as a pear that is diced or in a bowl.
Standard VLM classifiers map all instances of a class to a single vector based on the class label, failing to capture the rich diversity within a class.

Plain English Explanation

Vision-language models are AI systems that can look at an image and identify the objects in it, without needing any additional training on those specific objects. This zero-shot learning approach is a big step forward, as it means these models can be used to classify a huge variety of objects.

However, even the best vision-language models today struggle when the objects don't look exactly like the typical examples they were trained on. For instance, a model might be great at recognizing whole pears, but have trouble with pears that are diced up or sitting in a bowl. The reason is that these models try to represent each object class with a single, average vector, but real-world objects can come in many different forms within the same class.

Technical Explanation

The researchers argue that to better represent this intra-class diversity, zero-shot classification should move beyond a single vector per class. They propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without any retraining.

Their experiments show this approach consistently outperforms standard zero-shot classification across a wide range of datasets, including those with hierarchies, diverse object states, and real-world geographic diversity. It also works well on finer-grained datasets where intra-class diversity may be less prevalent.

Importantly, the method is inherently interpretable, providing faithful explanations for each inference to facilitate model debugging and enhance transparency. The researchers also find their method scales efficiently to a large number of attributes to account for diversity, leading to more accurate predictions for atypical instances.

Critical Analysis

The paper acknowledges that there is a trade-off between overall accuracy and worst-class accuracy that can be tuned via a hyperparameter. This suggests the method may not perform equally well across all classes, and users may need to experiment to find the right balance for their needs.

Additionally, while the method is shown to work well on a wide range of datasets, the extent to which it can generalize to truly open-world, unconstrained object recognition remains an open question. Further research may be needed to understand the limits of the approach.

Conclusion

This work represents an important step towards building vision-language models that can truly capture the rich diversity of the real world, beyond just the typical examples they were trained on. By moving beyond a single class vector, the proposed method demonstrates the potential for more interpretable and transparent AI systems that maintain high performance, even on atypical object instances. Continued research in this direction could lead to significantly more capable and trustworthy computer vision applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evolving Interpretable Visual Classifiers with Large Language Models

Mia Chiquier, Utkarsh Mall, Carl Vondrick

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

4/16/2024

cs.CV cs.AI

🏷️

LLM meets Vision-Language Models for Zero-Shot One-Class Classification

Yassir Bendou, Giulia Lioi, Bastien Pasdeloup, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Vincent Gripon

We consider the problem of zero-shot one-class visual classification, extending traditional one-class classification to scenarios where only the label of the target class is available. This method aims to discriminate between positive and negative query samples without requiring examples from the target class. We propose a two-step solution that first queries large language models for visually confusing objects and then relies on vision-language pre-trained models (e.g., CLIP) to perform classification. By adapting large-scale vision benchmarks, we demonstrate the ability of the proposed method to outperform adapted off-the-shelf alternatives in this setting. Namely, we propose a realistic benchmark where negative query samples are drawn from the same original dataset as positive ones, including a granularity-controlled version of iNaturalist, where negative samples are at a fixed distance in the taxonomy tree from the positive ones. To our knowledge, we are the first to demonstrate the ability to discriminate a single category from other semantically related ones using only its label.

5/28/2024

cs.CV cs.AI

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Oindrila Saha, Grant Van Horn, Subhransu Maji

The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this bag-level image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future research in zero-shot recognition.

4/5/2024

cs.CV

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Abdelrahman Abdelhamed, Mahmoud Afifi, Alec Go

Large language models (LLMs) has been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. By employing multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward, set of prompts across all datasets. We evaluated our method on several datasets, and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average over ten benchmarks, our method achieved an accuracy gain of 4.1 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.

5/27/2024

cs.CV