Interpreting CLIP's Image Representation via Text-Based Decomposition

Read original: arXiv:2310.05916 - Published 4/1/2024 by Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

Interpreting CLIP's Image Representation via Text-Based Decomposition

Introduction

Figure 1: CLIP-ViT image representation decomposition. By decomposing CLIP’s image representation as a sum across individual image patches, model layers, and attention heads, we can (a) characterize each head’s role by automatically finding text-interpretable directions that span its output space, (b) highlight the image regions that contribute to the similarity score between image and text, and (c) present what regions contribute towards a found text direction at a specific head.

Related Work

Decomposing CLIP Image Representation into Layers

Decomposition into Attention Heads

Decomposition into Image Tokens

$Figure 7: Joint decomposition examples. For each head (l,h)𝑙ℎ(l,h)( italic_l , italic_h ), the left heatmap (green border) corresponds to the description that is most similar to cheadl,hsubscriptsuperscript𝑐𝑙ℎheadc^{l,h}_{\text{head}}italic_c start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT head end_POSTSUBSCRIPT among the TextSpan output set. The right heatmap (red border) corresponds to the least similar text in this set (for m=60𝑚60m=60italic_m = 60). See Figure 9 for more results.$

Figure 7: Joint decomposition examples. For each head (l,h)𝑙ℎ(l,h)( italic_l , italic_h ), the left heatmap (green border) corresponds to the description that is most similar to cheadl,hsubscriptsuperscript𝑐𝑙ℎheadc^{l,h}_{\text{head}}italic_c start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT head end_POSTSUBSCRIPT among the TextSpan output set. The right heatmap (red border) corresponds to the least similar text in this set (for m=60𝑚60m=60italic_m = 60). See Figure 9 for more results.

Limitations and Discussion

Appendix A Appendix

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Interpreting CLIP's Image Representation via Text-Based Decomposition

Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.

4/1/2024

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

Recent works have explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features.These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations.

6/4/2024

Quantifying and Enabling the Interpretability of CLIP-like Models

Avinash Madasu, Yossi Gandelsman, Vasudev Lal, Phillip Howard

CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CLIP like models. We conduct this study on six different CLIP models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties. We then evaluate how easily these heads can be interpreted using new metrics which measure property consistency within heads and property disentanglement across heads. Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts. To further assist users in understanding the inner workings of CLIP models, we introduce CLIP-InterpreT, a tool designed for interpretability analysis. CLIP-InterpreT offers five types of analyses: property-based nearest neighbor search, per-head topic segmentation, contrastive segmentation, per-head nearest neighbors of an image, and per-head nearest neighbors of text.

9/11/2024

Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi

Recent text-to-image diffusion-based generative models have the stunning ability to generate highly detailed and photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate this compositionality-based failure mode and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes which shows that the output space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that the best compositional improvements can be achieved (without harming the model's FID scores) by fine-tuning {it only} a simple linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. This result demonstrates that the sub-optimality of the CLIP's output space is a major error source. We also show that re-weighting the erroneous attention contributions in CLIP can also lead to improved compositional performances, however these improvements are often less significant than those achieved by solely learning a linear projection head, highlighting erroneous attentions to be only a minor error source.

6/13/2024