Approximate Nearest Neighbor Search with Window Filters

Read original: arXiv:2402.00943 - Published 6/5/2024 by Joshua Engels, Benjamin Landrum, Shangdi Yu, Laxman Dhulipala, Julian Shun

Approximate Nearest Neighbor Search with Window Filters

Overview

This paper proposes a novel approximate nearest neighbor search algorithm that uses window filters to improve search efficiency.
The algorithm is designed to work with large-scale vector databases, which are commonly used in various applications like image retrieval and recommendation systems.
The key idea is to use a two-stage filtering process to quickly identify a set of candidate nearest neighbors, and then perform a more accurate search within this reduced set.

Plain English Explanation

The paper describes a new way to quickly find the nearest matching items in a large database of vector data. This is a common problem in many applications, like searching for similar images or recommending products.

The main innovation is a "two-stage filtering" approach. First, the algorithm uses a fast but imprecise way to identify a small set of possible matches. Then, it does a more thorough search just within that reduced set to find the actual nearest neighbor. This is more efficient than searching the entire database each time.

The authors test their algorithm on large-scale vector datasets and show it can provide approximate nearest neighbor search results much faster than previous methods, while still maintaining good accuracy.

Technical Explanation

The paper introduces a new approximate nearest neighbor (ANN) search algorithm called "Window Filters". The key idea is to use a two-stage filtering process to quickly identify a set of candidate nearest neighbors, and then perform a more accurate search within this reduced set.

In the first stage, the algorithm uses a "window filter" to construct a set of candidate neighbors. This filter defines a rectangular region around the query vector and selects all database vectors that fall within that region. This can be done very efficiently using specialized data structures like k-d trees.

In the second stage, the algorithm performs a more precise nearest neighbor search, but only within the set of candidates identified in the first stage. This allows it to avoid searching the entire database, which is computationally expensive.

The authors evaluate their Window Filters algorithm on several large-scale vector datasets, including SIFT features and word embeddings. They show it can achieve significant speedups over traditional ANN search methods, while maintaining competitive accuracy.

Critical Analysis

The Window Filters algorithm represents a clever way to improve the efficiency of approximate nearest neighbor search, but it does have some limitations.

One key assumption is that the query vector and its nearest neighbors will be clustered together in the vector space. If this is not the case, for example if the nearest neighbors are spread out, then the window filter may not be effective at reducing the search space.

Additionally, the performance of the algorithm is sensitive to the choice of the window size parameter. If the window is too small, it may miss some relevant neighbors; if it's too large, the efficiency gains will be reduced. The paper does not provide clear guidance on how to set this parameter optimally.

Finally, the algorithm is designed for static datasets. It may not work as well for dynamic datasets where vectors are being continuously added or removed. Adapting the approach to handle such cases could be an interesting area for future research.

Conclusion

Overall, the Window Filters algorithm represents a promising approach for improving the efficiency of approximate nearest neighbor search in large-scale vector databases. By using a two-stage filtering process, it can achieve significant speedups over traditional methods, while still maintaining good accuracy.

While the algorithm has some limitations, it demonstrates the value of leveraging specialized data structures and multi-stage search strategies to tackle the computational challenges of working with massive high-dimensional datasets. As vector-based applications continue to grow, techniques like this will likely become increasingly important.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Approximate Nearest Neighbor Search with Window Filters

Joshua Engels, Benjamin Landrum, Shangdi Yu, Laxman Dhulipala, Julian Shun

We define and investigate the problem of $textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75times$ speedup over existing solutions at the same level of recall.

6/5/2024

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search

Thomas Vecchiato, Claudio Lucchese, Franco Maria Nardini, Sebastian Bruch

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

4/19/2024

Learning-Augmented Frequency Estimation in Sliding Windows

Rana Shahout, Ibrahim Sabek, Michael Mitzenmacher

We show how to utilize machine learning approaches to improve sliding window algorithms for approximate frequency estimation problems, under the ``algorithms with predictions'' framework. In this dynamic environment, previous learning-augmented algorithms are less effective, since properties in sliding window resolution can differ significantly from the properties of the entire stream. Our focus is on the benefits of predicting and filtering out items with large next arrival times -- that is, there is a large gap until their next appearance -- from the stream, which we show improves the memory-accuracy tradeoffs significantly. We provide theorems that provide insight into how and by how much our technique can improve the sliding window algorithm, as well as experimental results using real-world data sets. Our work demonstrates that predictors can be useful in the challenging sliding window setting.

9/19/2024

🧪

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Harsh Shah, Kashish Mittal, Ajit Rajwade

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. However, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

9/10/2024