Efficient Retrieval with Learned Similarities

Read original: arXiv:2407.15462 - Published 8/15/2024 by Bailu Ding, Jiaqi Zhai

Efficient Retrieval with Learned Similarities

Overview

This paper explores efficient ways to perform information retrieval tasks using learned similarities.
The key ideas are:
- Expressing complex similarity functions as a mixture of simple logistic functions.
- Learning these mixture parameters from data to improve retrieval accuracy.
- Leveraging this learned similarity for efficient nearest neighbor search.

Plain English Explanation

The paper introduces a new approach for information retrieval - the process of finding relevant information (documents, images, etc.) given a query. The core idea is to learn a similarity function that can accurately compare the query to the available items and retrieve the most relevant ones.

Traditionally, similarity functions used in retrieval systems are simple, like cosine similarity. However, the authors argue that real-world retrieval tasks often require more expressive, complex similarity functions. Their key insight is to represent these complex functions as a mixture of simple logistic functions.

The paper then shows how to learn the parameters of this mixture model from data, allowing the similarity function to be tailored to the specific retrieval task. This learned similarity can then be used to perform efficient nearest neighbor search to quickly identify the most relevant items for a given query.

The benefits of this approach are twofold:

Improved accuracy: The learned, more expressive similarity function can better capture the nuances of the retrieval task compared to simple, predefined functions.
Efficient retrieval: The mixture form allows for fast nearest neighbor search, enabling real-time retrieval even for large datasets.

Overall, this work provides a principled way to learn powerful yet efficient similarity functions for information retrieval, with potential applications in areas like text similarity, image search, and recommendation systems.

Technical Explanation

The paper introduces a mixture of logits representation for similarity functions used in information retrieval. Formally, the similarity between a query q and an item x is defined as:

s(q, x) = Σ w_i * σ(a_i^T q + b_i^T x + c_i)

where σ is the logistic sigmoid function, and the w_i, a_i, b_i, and c_i are learnable parameters.

This mixture form is shown to be highly expressive, able to capture complex similarity patterns that go beyond simple metrics like cosine similarity. The parameters of the mixture model are learned from data using gradient-based optimization.

The learned similarity function s(q, x) is then used to perform efficient nearest neighbor search for retrieval. The authors propose a novel algorithm that leverages the mixture structure to quickly identify the top-k most similar items for a given query, even in large-scale settings.

Experiments on several retrieval benchmarks demonstrate the effectiveness of the proposed approach. Compared to baselines like BERT-based retrieval, the learned mixture model achieves significant improvements in accuracy while maintaining efficient retrieval times.

Critical Analysis

The paper makes a compelling case for using more expressive similarity functions in information retrieval tasks. The mixture of logits representation is a flexible and theoretically grounded approach, with strong empirical results.

One potential limitation is the complexity of the learned similarity function, which may make it challenging to interpret or explain the model's decisions. The authors do not address this aspect in depth, though interpretability is an important consideration for many real-world applications.

Additionally, the paper focuses on the retrieval task itself, but does not explore how the learned similarities could be used in a broader system, such as for citation retrieval or question-answering. Investigating these downstream use cases could further demonstrate the practical value of the proposed approach.

Overall, this work presents a promising direction for enhancing the expressiveness and efficiency of information retrieval systems. Future research could explore ways to balance model complexity with interpretability, as well as integration with other tasks and applications.

Conclusion

This paper introduces a novel approach for efficient information retrieval using learned similarity functions. By representing complex similarities as a mixture of simple logistic functions, the authors show how to learn expressive yet efficient similarity models from data.

The key contributions are:

Expressive Similarity: The mixture of logits formulation allows for flexible and powerful similarity functions that go beyond simple metrics.
Efficient Retrieval: The mixture structure enables fast nearest neighbor search, enabling real-time retrieval even at large scale.
Improved Accuracy: Experiments demonstrate significant performance gains over baseline retrieval methods.

This work has the potential to enhance a wide range of information retrieval applications, from text similarity to image search to recommendation systems. By advancing both the expressiveness and efficiency of retrieval models, the proposed approach represents an important step forward in this fundamental area of machine learning and information processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficient Retrieval with Learned Similarities

Bailu Ding, Jiaqi Zhai

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such retrieval tasks, thanks to Maximum Inner Product Search (MIPS) that enabled efficient retrieval based on dot products. However, state-of-the-art retrieval algorithms have migrated to learned similarities. Such algorithms vary in form; the queries can be represented with multiple embeddings, complex neural networks can be deployed, the item ids can be decoded directly from queries using beam search, and multiple approaches can be combined in hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work investigates techniques for approximate nearest neighbor search with learned similarity functions. We first prove that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions. We next propose techniques to retrieve the approximate top K results using MoL with a tight bound. We finally compare our techniques with existing approaches, showing that MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency, while achieving > .99 recall rate of exact algorithms.

8/15/2024

Relevance Filtering for Embedding-based Retrieval

Nicholas Rossi, Juexin Lin, Feng Liu, Zhen Yang, Tony Lee, Alessandro Magnani, Ciya Liao

In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets. While maximizing recall of relevant items is usually the goal of retrieval systems, a low precision may lead to a poor search experience. Unlike lexical retrieval, which inherently limits the size of the retrieved set through keyword matching, dense retrieval via ANN search has no natural cutoff. Moreover, the cosine similarity scores of embedding vectors are often optimized via contrastive or ranking losses, which make them difficult to interpret. Consequently, relying on top-K or cosine-similarity cutoff is often insufficient to filter out irrelevant results effectively. This issue is prominent in product search, where the number of relevant products is often small. This paper introduces a novel relevance filtering component (called Cosine Adapter) for embedding-based retrieval to address this challenge. Our approach maps raw cosine similarity scores to interpretable scores using a query-dependent mapping function. We then apply a global threshold on the mapped scores to filter out irrelevant results. We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall. The effectiveness of our approach is demonstrated through experiments on both public MS MARCO dataset and internal Walmart product search data. Furthermore, online A/B testing on the Walmart site validates the practical value of our approach in real-world e-commerce settings.

8/12/2024

🔎

Description-Based Text Similarity

Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg

Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.

7/25/2024

COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval

Kush Juvekar, Anupam Purwar

This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG) that integrates cosine similarity and cosine distance measures to improve retrieval performance, particularly for sparse data. The traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces. However, it has been shown that this measure can yield arbitrary results in certain scenarios. To address this limitation, we incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors. Our approach is experimented on proprietary data, unlike recent publications that have used open-source datasets. The proposed method demonstrates enhanced retrieval performance and provides a more comprehensive understanding of the semantic relationships between documents or items. This hybrid strategy offers a promising solution for efficiently and accurately retrieving relevant information in knowledge-intensive applications, leveraging techniques such as BM25 (sparse) retrieval , vector (Dense) retrieval, and cosine distance based retrieval to facilitate efficient information retrieval.

6/4/2024