Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

2405.03651

Published 5/7/2024 by Nishant Yadav, Nicholas Monath, Manzil Zaheer, Rob Fergus, Andrew McCallum

🏷️

Abstract

Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Cross-encoder (CE) models perform better than embedding-based (dual-encoder) models at estimating query-item relevance, but existing approaches have limitations.
Dual-encoder-based "retrieve-and-rerank" approaches suffer from poor recall on new domains, and the retrieval is decoupled from the CE.
CUR-based approaches can be more accurate than dual-encoder-based ones, but require too many CE calls, making them impractical at scale.

Plain English Explanation

The paper proposes a new method to efficiently approximate the performance of cross-encoder (CE) models, which are better than embedding-based (dual-encoder) models at determining how relevant a query is to an item. Existing approaches have drawbacks:

Dual-encoder-based "retrieve-and-rerank" methods don't work well on new types of data, and the initial retrieval step is separate from the final ranking step.
CUR-based approaches can be more accurate, but require a huge number of CE model evaluations, making them impractical to use at a large scale.

The new method computes item embeddings offline by factorizing a sparse matrix of CE scores for training queries. This allows for efficient retrieval using approximate CE scores, without needing to run the expensive CE model many times. The approach also leverages dual-encoder models to initialize the embedding space, avoiding the need for resource-intensive fine-tuning.

Technical Explanation

The paper introduces a sparse-matrix factorization approach to efficiently approximate cross-encoder (CE) similarity scores for efficient retrieval. Existing methods have limitations:

Dual-encoder (DE) based "retrieve-and-rerank" approaches suffer from poor recall on new domains, as the retrieval with DE is decoupled from the final CE-based ranking.
CUR-based approaches can be more accurate than DE-based methods, but require a prohibitively large number of CE calls to compute item embeddings, making them impractical at scale.

The proposed method computes offline item embeddings by factorizing a sparse matrix of query-item CE scores for training queries. This allows for efficient k-NN search using the approximate CE similarity, requiring only a fraction of the CE calls compared to CUR-based methods. The approach also leverages DE models to initialize the embedding space, avoiding the need for expensive fine-tuning via distillation.

At test time, the method alternates between: 1) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved so far, and 2) using the updated query embedding to retrieve more items. This k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based baselines. The indexing approach also achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall.

Critical Analysis

The paper addresses important limitations of existing approaches for approximating cross-encoder (CE) similarity scores for efficient retrieval. The proposed sparse-matrix factorization method seems promising, as it achieves significant speedups over CUR-based and distillation-based methods while maintaining or improving retrieval performance.

However, the paper does not discuss the quality of the approximation to the true CE scores, nor does it explore the impact of the approximation error on downstream applications. Additionally, the method relies on having a sufficient number of training queries with CE scores, which may not always be available in practice.

Further research could investigate the robustness of the approach to different query and item distributions, as well as the potential for incorporating knowledge graph context to improve the quality of the approximation.

Conclusion

The paper presents a novel sparse-matrix factorization approach to efficiently approximate cross-encoder (CE) similarity scores for low-latency retrieval. This method addresses the limitations of existing dual-encoder-based and CUR-based approaches, enabling more accurate retrieval while requiring significantly fewer CE model evaluations.

The proposed technique has the potential to improve the performance and scalability of retrieval systems that rely on CE models, which are known to outperform embedding-based models in estimating query-item relevance. The authors demonstrate promising results in terms of retrieval recall and computational efficiency, suggesting that this approach could have a meaningful impact on real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Shallow Cross-Encoders for Low-Latency Retrieval

Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald

Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.

4/1/2024

cs.IR cs.CL

Event-enhanced Retrieval in Real-time Search

Yanan Zhang, Xiaoling Bai, Tianhua Zhou

The embedding-based retrieval (EBR) approach is widely used in mainstream search engine retrieval systems and is crucial in recent retrieval-augmented methods for eliminating LLM illusions. However, existing EBR models often face the semantic drift problem and insufficient focus on key information, leading to a low adoption rate of retrieval results in subsequent steps. This issue is especially noticeable in real-time search scenarios, where the various expressions of popular events on the Internet make real-time retrieval heavily reliant on crucial event information. To tackle this problem, this paper proposes a novel approach called EER, which enhances real-time retrieval performance by improving the dual-encoder model of traditional EBR. We incorporate contrastive learning to accompany pairwise learning for encoder optimization. Furthermore, to strengthen the focus on critical event information in events, we include a decoder module after the document encoder, introduce a generative event triplet extraction scheme based on prompt-tuning, and correlate the events with query encoder optimization through comparative learning. This decoder module can be removed during inference. Extensive experiments demonstrate that EER can significantly improve the real-time search retrieval performance. We believe that this approach will provide new perspectives in the field of information retrieval. The codes and dataset are available at https://github.com/open-event-hub/Event-enhanced_Retrieval .

4/10/2024

cs.CL cs.IR

🚀

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

4/30/2024

cs.IR

🛸

Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?

Minghan Li, Honglei Zhuang, Kai Hui, Zhen Qin, Jimmy Lin, Rolf Jagerman, Xuanhui Wang, Michael Bendersky

Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, cross-encoder rankers remains under-explored. A recent work of Weller et al. [44] shows that current expansion techniques benefit weaker models such as DPR and BM25 but harm stronger rankers such as MonoT5. In this paper, we re-examine this conclusion and raise the following question: Can query expansion improve generalization of strong cross-encoder rankers? To answer this question, we first apply popular query expansion methods to state-of-the-art cross-encoder rankers and verify the deteriorated zero-shot performance. We identify two vital steps for cross-encoders in the experiment: high-quality keyword generation and minimal-disruptive query modification. We show that it is possible to improve the generalization of a strong neural ranker, by prompt engineering and aggregating the ranking results of each expanded query via fusion. Specifically, we first call an instruction-following language model to generate keywords through a reasoning chain. Leveraging self-consistency and reciprocal rank weighting, we further combine the ranking results of each expanded query dynamically. Experiments on BEIR and TREC Deep Learning 2019/2020 show that the nDCG@10 scores of both MonoT5 and RankT5 following these steps are improved, which points out a direction for applying query expansion to strong cross-encoder rankers.

5/1/2024

cs.IR cs.AI