A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search
0
Sign in to get full access
Overview
- This paper proposes a learning-to-rank formulation for clustering-based approximate nearest neighbor (ANN) search.
- The approach aims to improve the accuracy of ANN search by learning a ranking function that can effectively identify the most relevant neighbors.
- The proposed method is evaluated on several benchmark datasets and shows promising results compared to traditional ANN search techniques.
Plain English Explanation
Searching for the nearest neighbors of a given data point is a fundamental task in many machine learning and data analysis applications. However, as the size of the data grows, performing an exhaustive search becomes computationally expensive. Approximate nearest neighbor (ANN) search techniques have been developed to address this issue by finding "good enough" neighbors quickly.
The paper introduces a new approach to ANN search that uses
The key insight is that by
Technical Explanation
The paper proposes a
The approach starts by
During the ANN search process, the system first identifies the most relevant clusters based on the learned ranking function, and then performs a more refined search within those clusters to find the nearest neighbors. This
The authors evaluate their approach on several benchmark datasets, including SIFT1M, Deep1M, and GLOVE, and compare it to traditional ANN search methods, such as
Critical Analysis
The paper presents a novel and promising approach to improving the accuracy of ANN search by
However, the paper also acknowledges some limitations of the proposed approach. For example, the
Additionally, the paper does not provide a deep analysis of the computational complexity of the proposed method, which could be an important factor in real-world applications where efficiency is crucial.
Overall, the paper presents an interesting and promising
Conclusion
The paper introduces a
The proposed method
The paper's experimental results on several benchmark datasets are promising, and the
This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!
Related Papers
0
A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search
Thomas Vecchiato, Claudio Lucchese, Franco Maria Nardini, Sebastian Bruch
A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.
Read more4/19/2024
0
Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search
Sebastian Bruch, Aditya Krishnan, Franco Maria Nardini
Clustering-based nearest neighbor search is a simple yet effective method in which data points are partitioned into geometric shards to form an index, and only a few shards are searched during query processing to find an approximate set of top-$k$ vectors. Even though the search efficacy is heavily influenced by the algorithm that identifies the set of shards to probe, it has received little attention in the literature. This work attempts to bridge that gap by studying the problem of routing in clustering-based maximum inner product search (MIPS). We begin by unpacking existing routing protocols and notice the surprising contribution of optimism. We then take a page from the sequential decision making literature and formalize that insight following the principle of ``optimism in the face of uncertainty.'' In particular, we present a new framework that incorporates the moments of the distribution of inner products within each shard to optimistically estimate the maximum inner product. We then present a simple instance of our algorithm that uses only the first two moments to reach the same accuracy as state-of-the-art routers such as scann by probing up to $50%$ fewer points on a suite of benchmark MIPS datasets. Our algorithm is also space-efficient: we design a sketch of the second moment whose size is independent of the number of points and in practice requires storing only $O(1)$ additional vectors per shard.
Read more5/21/2024
0
Efficient Retrieval with Learned Similarities
Bailu Ding, Jiaqi Zhai
Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently finding relevant items from a large corpus given a query. Dot products have been widely used as the similarity function in such retrieval tasks, thanks to Maximum Inner Product Search (MIPS) that enabled efficient retrieval based on dot products. However, state-of-the-art retrieval algorithms have migrated to learned similarities. Such algorithms vary in form; the queries can be represented with multiple embeddings, complex neural networks can be deployed, the item ids can be decoded directly from queries using beam search, and multiple approaches can be combined in hybrid solutions. Unfortunately, we lack efficient solutions for retrieval in these state-of-the-art setups. Our work investigates techniques for approximate nearest neighbor search with learned similarity functions. We first prove that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions. We next propose techniques to retrieve the approximate top K results using MoL with a tight bound. We finally compare our techniques with existing approaches, showing that MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency, while achieving > .99 recall rate of exact algorithms.
Read more8/15/2024
2
Approximate Nearest Neighbor Search with Window Filters
Joshua Engels, Benjamin Landrum, Shangdi Yu, Laxman Dhulipala, Julian Shun
We define and investigate the problem of $textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75times$ speedup over existing solutions at the same level of recall.
Read more6/5/2024