Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

2405.03190

Published 5/7/2024 by Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron

✨

Abstract

In the recent years, the dual-encoder vision-language models (eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.

Create account to get full access

Overview

The paper explores the issue of dual-encoder vision-language models, such as CLIP, producing very different results for paraphrased queries during text-to-image retrieval.
The authors hypothesize that this behavior is due to the text towers of these models being trained on image-sentence pairs, lacking the ability to capture the semantic similarity between paraphrased queries.
To address this, the paper investigates strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus.

Plain English Explanation

Dual-encoder vision-language models, like CLIP, have become very good at retrieving relevant images when given a text description. However, the authors of this paper found that these models often return very different results when given a pair of paraphrased queries - that is, queries that say the same thing in slightly different words.

This behavior can make the retrieval system less predictable and potentially frustrate users. The authors believe the problem is that the text parts of these dual-encoder models are trained on pairs of images and sentences, but don't learn to recognize the semantic similarity between paraphrased queries.

To address this, the authors try training a dual-encoder model in a different way, starting with a language model that has been pretrained on a large amount of general text data. They hypothesize that this will help the text part of the model better understand the meaning behind different phrasings of a query.

Technical Explanation

The paper starts by collecting a dataset of paraphrased image descriptions to use for quantitative evaluation of the paraphrased text-to-image retrieval task.

The authors then investigate several training strategies for a dual-encoder model, all starting from a language model pretrained on a large text corpus. These strategies include:

Directly fine-tuning the pretrained language model on image-text pairs.
Introducing an additional "paraphrase detection" objective during training to help the model recognize semantic similarity.
Leveraging contrastive learning techniques, as used in models like CLIP and ALIGN, to jointly learn visual and textual representations.

The paper finds that the model trained with the authors' best adaptation strategy achieves significantly higher ranking similarity for paraphrased queries, while still maintaining similar performance on zero-shot classification and retrieval tasks compared to public dual-encoder models like CLIP and OpenCLIP.

Critical Analysis

The paper provides a thoughtful analysis of an important issue with current dual-encoder vision-language models - their tendency to return very different results for paraphrased queries. This behavior can undermine the usability and predictability of these models in real-world applications.

The authors' proposed solutions, building on a pretrained language model and incorporating additional training objectives, seem promising. However, the paper does not explore the generalization of these techniques to other dual-encoder architectures or tasks beyond text-to-image retrieval.

Additionally, the paper does not delve into potential limitations or failure cases of the proposed approach. It would be valuable to understand the scenarios where the model may still struggle with paraphrased queries, and what factors contribute to its success or failure.

Further research could also investigate the tradeoffs between the paraphrased query handling capabilities and other performance metrics, such as zero-shot classification or overall retrieval accuracy. Understanding these tradeoffs would help guide the development of dual-encoder models that balance various user-centric and task-specific requirements.

Conclusion

This paper tackles an important challenge in dual-encoder vision-language models - their sensitivity to paraphrased queries during text-to-image retrieval. By starting with a pretrained language model and incorporating additional training objectives, the authors demonstrate a path to improving the model's ability to recognize semantic similarity across paraphrased inputs.

These techniques could have significant implications for enhancing the usability and predictability of large language model-powered multimodal retrieval systems, making them more responsive to the nuances of human language. Further research in this direction could lead to more reliable and user-friendly vision-language AI applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Cross-lingual paraphrase identification

Inessa Fedorova, Aleksei Musatow

The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.

6/24/2024

cs.CL

RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP

6/12/2024

cs.CV

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024

cs.CV cs.CL

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael Gunther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Mart'inez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

6/27/2024

cs.CL cs.AI cs.CV cs.IR