Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Read original: arXiv:2401.11248 - Published 4/23/2024 by Guangyuan Ma, Xing Wu, Zijia Lin, Songlin Hu

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Overview

This paper proposes a new pre-training method called "Bag-of-Word (BoW) Prediction" for improving the performance of dense passage retrieval models.
The key idea is to pre-train the model to predict a bag of words from the input text, rather than trying to reconstruct the entire input through a decoder-based approach like traditional masked auto-encoders.
The authors show that this BoW prediction pre-training leads to better retrieval performance compared to other pre-training methods, without the need for a decoder module.

Plain English Explanation

The paper introduces a new way to pre-train dense retrieval models, which are used to quickly find relevant information from large text databases. The standard approach is to use a masked auto-encoder, where the model tries to reconstruct the original text from a partially masked version.

However, the authors argue that this decoder-based approach is not necessary. Instead, they propose pre-training the model to simply predict a "bag of words" - the set of words that appear in the text, without worrying about the order. This is simpler and more efficient, yet still teaches the model useful representations for retrieval.

The authors show that this Bag-of-Word (BoW) Prediction pre-training leads to better retrieval performance compared to other methods, like using a traditional auto-encoder or no pre-training at all. It's like teaching a search engine to understand the key concepts in a document, rather than trying to memorize the entire text.

Technical Explanation

The key technical innovation in this paper is the Bag-of-Word (BoW) Prediction pre-training approach. Instead of using a decoder to reconstruct the full input text, as in a standard masked auto-encoder pre-training, the authors train the model to predict the set of words that appear in the input.

Specifically, they randomly mask out some tokens in the input text, and then train the model to predict a binary vector indicating the presence or absence of each word in the vocabulary. This BoW prediction task encourages the model to learn representations that capture the key semantic concepts in the text, without needing to memorize the exact wording.

The authors evaluate this BoW Prediction pre-training on several dense passage retrieval benchmarks, and show that it outperforms other pre-training approaches, including BERT-style masked language modeling and contrastive learning. They argue that the simpler BoW objective is more effective for learning useful representations for retrieval tasks.

Critical Analysis

The authors provide a thorough evaluation of their BoW Prediction pre-training approach, considering its performance on multiple retrieval datasets and comparing to strong baselines. The results are impressive, showing clear benefits over other pre-training methods.

However, the paper does not explore the limitations of this approach in depth. For example, it's not clear how the BoW Prediction pre-training would perform on more complex language understanding tasks beyond retrieval, where the order and structure of the text may be more important.

Additionally, the authors do not investigate the interpretability or explainability of the learned representations, which could be an interesting area for further research. Understanding how the BoW Prediction objective shapes the model's internal representations could provide insights into its strengths and weaknesses.

Overall, this is a well-executed study that makes a compelling case for the BoW Prediction pre-training approach. The simplicity and strong empirical results are compelling, and the work opens up interesting directions for future research on efficient pre-training methods for retrieval and beyond.

Conclusion

This paper presents a novel pre-training method called Bag-of-Word (BoW) Prediction that improves the performance of dense passage retrieval models. By training the model to predict the set of words present in the input, rather than trying to reconstruct the full text, the authors show significant gains in retrieval accuracy compared to other pre-training approaches.

The key insight is that the BoW Prediction objective is simpler and more efficient than traditional decoder-based pre-training, yet still allows the model to learn useful representations for finding relevant information in large text databases. This work demonstrates the potential benefits of rethinking pre-training objectives to better suit the target task, and could inspire similar approaches for other language understanding problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval

Guangyuan Ma, Xing Wu, Zijia Lin, Songlin Hu

Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding.

4/23/2024

💬

BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

Wen Liang, Youzhi Liang

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

6/11/2024

✨

3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining

Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang

Masked autoencoders (MAE) have recently been introduced to 3D self-supervised pretraining for point clouds due to their great success in NLP and computer vision. Unlike MAEs used in the image domain, where the pretext task is to restore features at the masked pixels, such as colors, the existing 3D MAE works reconstruct the missing geometry only, i.e, the location of the masked points. In contrast to previous studies, we advocate that point location recovery is inessential and restoring intrinsic point features is much superior. To this end, we propose to ignore point position reconstruction and recover high-order features at masked points including surface normals and surface variations, through a novel attention-based decoder which is independent of the encoder design. We validate the effectiveness of our pretext task and decoder design using different encoder structures for 3D training and demonstrate the advantages of our pretrained networks on various point cloud analysis tasks.

4/30/2024

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao, Xiaoqin Zhang, Guobao Xiao, Min Li, Tao Wang, Mang Ye

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

6/11/2024