AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Read original: arXiv:2309.16414 - Published 8/15/2024 by Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

📉

Overview

Classifiers built on vision-language models like CLIP have shown strong zero-shot performance on image classification tasks.
Prior work has explored different ways to create class descriptors for zero-shot classification, from manual templates to templates generated by language models.
However, the approach to deriving zero-shot classifiers from these descriptors has remained largely unchanged - classifying to the class with the highest cosine similarity between its descriptor and the image encoding.
This paper proposes AutoCLIP, a method to auto-tune zero-shot classifiers by learning per-image weights for each class descriptor.

Plain English Explanation

Vision-language models like CLIP have shown impressive ability to classify images without any training on those specific classes. This is called "zero-shot" classification. The key idea is to represent each class with a short text descriptor, like "a photo of a dog," and then match the image to the class whose descriptor is most similar.

Prior research has explored different ways to generate these class descriptors, from manually writing them to using language models to automatically produce them. However, the way to actually classify the images has stayed the same - just picking the class with the highest similarity between its descriptor and the image.

The new AutoCLIP method proposed in this paper aims to improve on this by automatically learning weights for each class descriptor, based on how well that descriptor matches the visual clues in the image. This allows the system to focus on the most relevant descriptors for each image, rather than treating them all equally.

The key advantages of AutoCLIP are that it's fully automatic, has minimal additional computational cost, and can provide consistent improvements in accuracy across different vision-language models, datasets, and descriptor templates.

Technical Explanation

The core of the AutoCLIP method is a simple post-processing step applied after obtaining the initial zero-shot classification from a vision-language model. For each image, AutoCLIP learns a set of weights that are applied to the class descriptors before computing the final classification.

Specifically, AutoCLIP first computes the cosine similarity between the image encoding and each class descriptor. It then applies a softmax function to these similarities to obtain per-descriptor weights. These weights are used to compute a weighted average of the class descriptors, which is then compared to the image encoding to determine the final classification.

The key innovation is that these weights are learned in a fully unsupervised manner, based solely on the statistics of the descriptor-image similarities for that particular image. This allows AutoCLIP to focus the classification on the most relevant descriptors, rather than treating them all equally.

The authors show that AutoCLIP provides consistent improvements of up to 3 percentage points in zero-shot classification accuracy across a range of vision-language models, datasets, and prompt templates. The improvements are particularly notable on datasets with high visual diversity where the relevance of different descriptors can vary greatly across images.

Critical Analysis

The AutoCLIP method is a straightforward and elegant approach to improving zero-shot classification from vision-language models. The key strengths are its simplicity, generality, and consistent empirical performance gains.

That said, the paper does not provide much insight into the types of images or classes where AutoCLIP is most beneficial. Further research could explore the characteristics of datasets and models where the weighted descriptor approach is most impactful.

Additionally, the authors note that AutoCLIP assumes the initial zero-shot classifier is reasonably accurate. If the base classifier performs poorly, AutoCLIP may not be able to effectively re-weight the descriptors. Exploring strategies to make AutoCLIP more robust to weak base classifiers could be an interesting area for future work.

Overall, AutoCLIP represents a practical and effective method for enhancing zero-shot classification from vision-language models. The simplicity and generality of the approach make it a promising direction for further research and real-world applications.

Conclusion

This paper introduces AutoCLIP, a method for automatically tuning zero-shot image classifiers built on vision-language models. By learning per-image weights for the class descriptors, AutoCLIP can focus the classification on the most relevant visual cues, leading to consistent accuracy improvements of up to 3 percentage points.

The key advantages of AutoCLIP are its simplicity, generality, and minimal computational overhead. This makes it a promising technique for enhancing the practical performance of zero-shot classification in real-world applications. Further research could explore the characteristics of datasets and models where AutoCLIP is most beneficial, as well as strategies to make it more robust to weak base classifiers.

Overall, the AutoCLIP method represents an elegant and effective approach to improving zero-shot image classification, with the potential to unlock new applications and use cases for this powerful computer vision technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has only a minor additional computation overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.

8/15/2024

Online Zero-Shot Classification with CLIP

Qi Qian, Juhua Hu

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3%$ improvement on average, which demonstrates the effectiveness of our proposal. Code is available at url{https://github.com/idstcv/OnZeta}.

8/27/2024

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an object decoder model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.

7/19/2024

🛸

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP

6/7/2024