Exact Trajectory Similarity Search With N-tree: An Efficient Metric Index for kNN and Range Queries

Read original: arXiv:2408.07650 - Published 8/15/2024 by Ralf Hartmut Guting (Fernuniversitat in Hagen, Germany), Suvam Kumar Das (University of New Brunswick, Fredericton, Canada), Fabio Vald'es (Fernuniversitat in Hagen, Germany), Suprio Ray (University of New Brunswick, Fredericton, Canada)

🔄

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Exact Trajectory Similarity Search With N-tree: An Efficient Metric Index for kNN and Range Queries

Ralf Hartmut Guting (Fernuniversitat in Hagen, Germany), Suvam Kumar Das (University of New Brunswick, Fredericton, Canada), Fabio Vald'es (Fernuniversitat in Hagen, Germany), Suprio Ray (University of New Brunswick, Fredericton, Canada)

Similarity search is the problem of finding in a collection of objects those that are similar to a given query object. It is a fundamental problem in modern applications and the objects considered may be as diverse as locations in space, text documents, images, twitter messages, or trajectories of moving objects. In this paper we are motivated by the latter application. Trajectories are recorded movements of mobile objects such as vehicles, animals, public transportation, or parts of the human body. We propose a novel distance function called DistanceAvg to capture the similarity of such movements. To be practical, it is necessary to provide indexing for this distance measure. Fortunately we do not need to start from scratch. A generic and unifying approach is metric space, which organizes the set of objects solely by a distance (similarity) function with certain natural properties. Our function DistanceAvg is a metric. Although metric indexes have been studied for decades and many such structures are available, they do not offer the best performance with trajectories. In this paper we propose a new design, which outperforms the best existing indexes for kNN queries and is equally good for range queries. It is especially suitable for expensive distance functions as they occur in trajectory similarity search. In many applications, kNN queries are more practical than range queries as it may be difficult to determine an appropriate search radius. Our index provides exact result sets for the given distance function.

8/15/2024

A Bi-metric Framework for Fast Similarity Search

Haike Xu, Sandeep Silwal, Piotr Indyk

We propose a new bi-metric framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.

6/6/2024

🧪

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Harsh Shah, Kashish Mittal, Ajit Rajwade

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. However, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

9/10/2024

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search

Thomas Vecchiato, Claudio Lucchese, Franco Maria Nardini, Sebastian Bruch

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

4/19/2024