Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Read original: arXiv:2407.11211 - Published 7/19/2024 by Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Overview

This paper introduces a novel approach for unconstrained open vocabulary image classification, which enables zero-shot transfer from text to image using the CLIP (Contrastive Language-Image Pre-training) model.
The proposed method, called CLIP Inversion, can generate image representations from text prompts, allowing for classification of images without training on labeled data.
The paper demonstrates that CLIP Inversion outperforms existing zero-shot and few-shot learning approaches on various datasets, showcasing its ability to generalize to unseen visual concepts.

Plain English Explanation

The paper presents a new way to classify images without needing to train the system on labeled image data. Instead, it uses a technique called CLIP Inversion to generate image representations directly from text descriptions.

This is possible because of a pre-trained model called CLIP, which has learned to match visual and textual information during its training. By inverting the CLIP model, the researchers can take a text prompt (like "a golden retriever dog") and use it to create an image representation that captures the key visual features of that concept.

With this image representation, the system can then classify new images without any prior training on those specific visual categories. It essentially "knows" what a golden retriever looks like, even if it's never seen one before in the training data.

This zero-shot transfer capability is very powerful, as it allows the system to recognize a much broader range of visual concepts than would be possible with a traditional supervised learning approach. The paper shows that CLIP Inversion outperforms other zero-shot and few-shot learning methods on several benchmarks, demonstrating its ability to generalize to a wide variety of unseen visual concepts.

Technical Explanation

The core of the proposed approach is CLIP Inversion, which leverages the multimodal representations learned by the CLIP model to generate image features from text prompts. CLIP is a large neural network trained on a huge dataset of image-text pairs, enabling it to learn a joint embedding space that aligns visual and linguistic information.

By inverting the CLIP model, the researchers can take an arbitrary text description and use it to produce a corresponding image representation. This is done by optimizing the text input to the CLIP model such that its output image representation matches a target image as closely as possible. The resulting image features can then be used for classification tasks without any further training.

The paper evaluates CLIP Inversion on a range of open vocabulary image classification benchmarks, including CLIP-Adapted, COCO, and ImageNet datasets. The results demonstrate significant improvements over prior zero-shot and few-shot learning approaches, highlighting the power of the CLIP Inversion technique to generalize to unseen visual concepts.

Critical Analysis

The paper presents a compelling approach for unconstrained open vocabulary image classification, but it also acknowledges several limitations and avenues for future work.

One key limitation is that the CLIP Inversion process can be computationally expensive, as it requires optimization of the text input to generate each image representation. This may limit the scalability of the approach, particularly for real-time applications.

Additionally, the paper notes that the quality of the generated image representations is dependent on the expressiveness of the text prompts. Highly specific or complex descriptions may be difficult to capture accurately, potentially impacting classification performance.

Further research could explore methods to improve the efficiency and robustness of the CLIP Inversion process, as well as investigate ways to combine it with other zero-shot or few-shot learning techniques to enhance its capabilities. Addressing these challenges could unlock even greater potential for unconstrained open vocabulary image classification.

Conclusion

The paper introduces a novel approach called CLIP Inversion that enables zero-shot transfer from text to image for open vocabulary image classification. By leveraging the multimodal representations learned by the CLIP model, the proposed method can generate image features directly from text prompts, allowing for recognition of a much broader range of visual concepts than would be possible with traditional supervised learning.

The results demonstrate significant improvements over existing zero-shot and few-shot learning techniques, highlighting the power of CLIP Inversion to generalize to unseen visual categories. While the approach has some limitations, the paper's findings suggest that CLIP Inversion could be a valuable tool for a wide range of computer vision applications, particularly in scenarios where labeled image data is scarce or where there is a need to recognize a dynamic, ever-expanding set of visual concepts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Philipp Allgeuer, Kyra Ahrens, Stefan Wermter

We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an object decoder model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.

7/19/2024

📉

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has only a minor additional computation overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.

8/15/2024

Online Zero-Shot Classification with CLIP

Qi Qian, Juhua Hu

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3%$ improvement on average, which demonstrates the effectiveness of our proposal. Code is available at url{https://github.com/idstcv/OnZeta}.

8/27/2024

⛏️

Transductive Zero-Shot and Few-Shot CLIP

S'egol`ene Martin (OPIS, CVN), Yunshi Huang (ETS), Fereshteh Shakeri (ETS), Jean-Christophe Pesquet (OPIS, CVN), Ismail Ben Ayed (ETS)

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.

5/30/2024