Linking Representations with Multimodal Contrastive Learning

2304.03464

Published 6/26/2024 by Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell

🌀

Abstract

Many applications require linking individuals, firms, or locations across datasets. Most widely used methods, especially in social science, do not employ deep learning, with record linkage commonly approached using string matching techniques. Moreover, existing methods do not exploit the inherently multimodal nature of documents. In historical record linkage applications, documents are typically noisily transcribed by optical character recognition (OCR). Linkage with just OCR'ed texts may fail due to noise, whereas linkage with just image crops may also fail because vision models lack language understanding (e.g., of abbreviations or other different ways of writing firm names). To leverage multimodal learning, this study develops CLIPPINGS (Contrastively LInking Pooled Pre-trained Embeddings). CLIPPINGS aligns symmetric vision and language bi-encoders, through contrastive language-image pre-training on document images and their corresponding OCR'ed texts. It then contrastively learns a metric space where the pooled image-text embedding for a given instance is close to embeddings in the same class (e.g., the same firm or location) and distant from embeddings of a different class. Data are linked by treating linkage as a nearest neighbor retrieval problem with the multimodal embeddings. CLIPPINGS outperforms widely used string matching methods by a wide margin in linking mid-20th century Japanese firms across financial documents. A purely self-supervised model - trained only by aligning the embeddings for the image crop of a firm name and its corresponding OCR'ed text - also outperforms popular string matching methods. Fascinatingly, a multimodally pre-trained vision-only encoder outperforms a unimodally pre-trained vision-only encoder, illustrating the power of multimodal pre-training even if only one modality is available for linking at inference time.

Create account to get full access

Overview

Many applications require linking individuals, firms, or locations across datasets
Existing record linkage methods, especially in social science, do not use deep learning and rely on string matching techniques
Documents used for linkage may be noisy due to optical character recognition (OCR) transcription errors
Using just OCR text or just image crops can lead to linkage failures

Plain English Explanation

Many real-world applications need to connect information about the same people, companies, or places across different datasets. The most common methods for doing this, especially in fields like social science, don't use advanced machine learning techniques. Instead, they rely on string matching - comparing text strings to see how similar they are.

However, the documents used for this linkage process are often messy and imperfect. They may have been digitized using optical character recognition (OCR), which can introduce errors and noise into the text. Trying to link records using just the OCR'd text may fail due to these inaccuracies. But using just the image crops of the text could also be problematic, because computer vision models may struggle to understand things like abbreviations or alternative ways of writing company names.

To overcome these challenges, the researchers developed a new method called CLIPPINGS that combines computer vision and natural language processing. By pre-training the model on a large dataset of document images and their corresponding OCR'd text, CLIPPINGS learns to align the visual and textual representations. This allows it to effectively link records even when the text is noisy.

Technical Explanation

The CLIPPINGS model uses symmetric vision and language bi-encoders that are pre-trained in a contrastive manner on a dataset of document images and their OCR'd text. This "multimodal pre-training" allows the model to learn a shared embedding space where visually and textually similar instances (e.g., the same company name) are close together, while instances from different classes are distant.

During the contrastive pre-training, the model is trained to bring the pooled image-text embedding for a given instance close to embeddings from the same class (e.g., the same firm or location), and push it away from embeddings of different classes. This teaches the model to learn discriminative representations that can be used for efficient nearest neighbor retrieval-based linkage.

The researchers show that CLIPPINGS outperforms traditional string matching methods by a large margin on the task of linking mid-20th century Japanese firms across financial documents. Interestingly, they find that even a self-supervised, vision-only encoder pre-trained with this multimodal approach outperforms unimodally pre-trained vision models, demonstrating the power of multimodal pre-training.

Critical Analysis

The paper presents a compelling approach to the challenge of record linkage, particularly in the context of historical documents with noisy OCR. By leveraging multimodal learning, CLIPPINGS is able to overcome the limitations of using just text or just image data for linkage.

That said, the paper does not explore the performance of CLIPPINGS on more modern, born-digital documents, where OCR errors may be less prevalent. Additionally, the evaluation is limited to a single dataset of Japanese firms, so further research would be needed to understand how well the approach generalizes to other domains and languages.

The authors also do not delve into the computational efficiency of CLIPPINGS compared to alternative methods, which could be an important consideration for real-world applications. Gentle-CLIP and RankCLIP offer potential paths for improving the efficiency of multimodal pre-training approaches like CLIPPINGS.

Nonetheless, the core idea of contrastively learning multimodal representations for record linkage is compelling and could have broader applicability beyond the specific use case explored in this paper. Further research into the limitations and tradeoffs of multimodal approaches would be valuable for understanding when and how to effectively deploy them.

Conclusion

The CLIPPINGS model presents a novel approach to the challenge of record linkage, particularly in the context of historical documents with noisy OCR. By leveraging multimodal learning to align visual and textual representations, CLIPPINGS is able to outperform traditional string matching methods and handle the inherent challenges of working with imperfect data.

While the current evaluation is limited, the core idea of contrastively learning discriminative multimodal embeddings for efficient nearest neighbor retrieval is a promising direction for further research and development. As the field of multimodal AI continues to advance, techniques like CLIPPINGS could have far-reaching implications for a wide range of applications that require linking diverse datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Fuxiao Liu, Hao Tan, Chris Tensmeyer

Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding single image associated with a single piece of text, they often ignore the alignment at the intra-document level, consisting of multiple sentences with multiple images. In this work, we propose DocumentCLIP, a salience-aware contrastive learning framework to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents. Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content. To the best of our knowledge, we are the first to explore multimodal intra-document links by contrastive learning. In addition, we collect a large Wikipedia dataset for pretraining, which provides various topics and structures. Experiments show DocumentCLIP not only outperforms the state-of-the-art baselines in the supervised setting, but also achieves the best zero-shot performance in the wild after human evaluation. Our code is available at https://github.com/FuxiaoLiu/DocumentCLIP.

4/29/2024

cs.CV cs.AI

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

6/21/2024

cs.CV cs.AI cs.LG

Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li

Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.

6/11/2024

cs.LG cs.AI cs.CL cs.CV

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in bag-of-words representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

4/26/2024

cs.CV