Efficient Multi-Vector Dense Retrieval Using Bit Vectors

2404.02805

Published 4/4/2024 by Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

Efficient Multi-Vector Dense Retrieval Using Bit Vectors

Abstract

Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper proposes an efficient multi-vector dense retrieval system using bit vectors.
The system aims to improve the speed and scalability of dense retrieval tasks, which are critical for many natural language processing applications.
The key ideas include using multiple vectors to represent each document and efficiently storing and comparing these vectors using bit-level operations.

Plain English Explanation

The paper describes a new way to quickly find relevant documents or pieces of text in a large database. This is an important task for many AI and language applications, like search engines, question answering, and summarization.

The researchers' approach uses multiple short bit vectors to represent each document, instead of a single long vector. Bit vectors are a compact way to store vector data, where each element is just a 1 or a 0. By using multiple bit vectors per document, the system can capture more nuanced information about the document's content.

To search the database, the system quickly compares the query's bit vectors against all the document bit vectors using efficient bitwise operations. This is faster than the traditional approach of comparing long, dense vectors. The researchers also describe ways to further optimize the storage and comparison of these bit vectors to improve speed and scalability.

Technical Explanation

The paper introduces a multi-vector dense retrieval system that represents each document with multiple short bit vectors, rather than a single long dense vector.

Each document's content is first encoded into multiple word embeddings using a pre-trained language model. These word vectors are then aggregated into multiple short bit vectors, where each bit indicates the presence or absence of certain semantic properties.

To retrieve relevant documents for a query, the system compares the query's bit vectors against all the document bit vectors using efficient bitwise operations. This is faster than the traditional approach of computing cosine similarity between long, dense vectors.

The paper also describes techniques to further optimize the storage and comparison of these bit vectors. For example, the bit vectors can be stored in a balanced way to improve memory access patterns. Additionally, the bit vectors can be compressed and processed in a way that takes advantage of the inherent structure in natural language.

Critical Analysis

The paper presents a novel and promising approach for efficient multi-vector dense retrieval. The use of bit vectors and bitwise operations to quickly compare document representations is a clever optimization that could significantly improve the speed and scalability of dense retrieval tasks.

However, the paper does not provide a comprehensive evaluation of the system's performance compared to state-of-the-art dense retrieval methods. The experiments are limited to a few datasets and do not explore the system's robustness to different types of queries or data distributions.

Additionally, the paper does not discuss potential limitations or drawbacks of the multi-vector approach. For example, it's unclear how the system would handle cases where the semantic properties captured by the bit vectors do not align well with the user's information needs.

Further research is needed to better understand the trade-offs and limitations of this approach, as well as to explore ways to make it more adaptive and flexible for a broader range of retrieval tasks and applications.

Conclusion

This paper introduces an efficient multi-vector dense retrieval system that uses bit vectors to represent and compare document content. The key ideas, including the use of multiple bit vectors per document and optimized bitwise operations for retrieval, have the potential to significantly improve the speed and scalability of dense retrieval tasks.

While the paper presents promising initial results, more comprehensive evaluation and analysis are needed to fully understand the capabilities and limitations of this approach. Nonetheless, the work represents an interesting and innovative contribution to the field of efficient text retrieval, and it could have important implications for a wide range of natural language processing applications.

Related Papers

LeanVec: Searching vectors faster by making them fit

Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Ted Willke

Modern deep learning models have the ability to generate high-dimensional vectors whose similarity reflects semantic resemblance. Thus, similarity search, i.e., the operation of retrieving those vectors in a large collection that are similar to a given query, has become a critical component of a wide range of applications that demand highly accurate and timely answers. In this setting, the high vector dimensionality puts similarity search systems under compute and memory pressure, leading to subpar performance. Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. However, these queries often have different distributions than the database embeddings, making it challenging to achieve high accuracy. In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy. We present LeanVec variants for in-distribution (ID) and out-of-distribution (OOD) queries. LeanVec-ID yields accuracies on par with those from recently introduced deep learning alternatives whose computational overhead precludes their usage in practice. LeanVec-OOD uses two novel techniques for dimensionality reduction that consider the query and database distributions to simultaneously boost the accuracy and the performance of the framework even further (even presenting competitive results when the query and database distributions match). All in all, our extensive and varied experimental results show that LeanVec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art.

4/4/2024

cs.LG cs.DB

🤔

Semi-Parametric Retrieval via Binary Token Index

Jiawei Zhou, Li Dong, Furu Wei, Lei Chen

The landscape of information retrieval has broadened from search services to a critical component in various advanced applications, where indexing efficiency, cost-effectiveness, and freshness are increasingly important yet remain less explored. To address these demands, we introduce Semi-parametric Vocabulary Disentangled Retrieval (SVDR). SVDR is a novel semi-parametric retrieval framework that supports two types of indexes: an embedding-based index for high effectiveness, akin to existing neural retrieval methods; and a binary token index that allows for quick and cost-effective setup, resembling traditional term-based retrieval. In our evaluation on three open-domain question answering benchmarks with the entire Wikipedia as the retrieval corpus, SVDR consistently demonstrates superiority. It achieves a 3% higher top-1 retrieval accuracy compared to the dense retriever DPR when using an embedding-based index and an 9% higher top-1 accuracy compared to BM25 when using a binary token index. Specifically, the adoption of a binary token index reduces index preparation time from 30 GPU hours to just 2 CPU hours and storage size from 31 GB to 2 GB, achieving a 90% reduction compared to an embedding-based index.

5/6/2024

cs.CL cs.AI cs.IR

🔍

PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval

Dawn Lawrie, Efsun Kayi, Eugene Yang, James Mayfield, Douglas W. Oard

PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively.

5/3/2024

cs.IR cs.CL

A Reproducibility Study of PLAID

Sean MacAvaney, Nicola Tonellotto

The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its three parameters; deviations beyond the suggested settings can substantially increase latency without necessarily improving its effectiveness. We then compare PLAID with an important baseline missing from the paper: re-ranking a lexical system. We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings. However, re-ranking cannot reach peak effectiveness at higher latency settings due to limitations in recall of lexical matching and provides a poor approximation of an exhaustive ColBERTv2 search. We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation, providing a Pareto frontier across all operational points for ColBERTv2 when evaluated using a well-annotated dataset. Curious about why re-ranking methods are highly competitive with PLAID, we analyze the token representation clusters PLAID uses for retrieval and find that most clusters are predominantly aligned with a single token and vice versa. Given the competitive trade-offs that re-ranking baselines exhibit, this work highlights the importance of carefully selecting pertinent baselines when evaluating the efficiency of retrieval engines.

4/24/2024

cs.IR cs.CL