LeanVec: Searching vectors faster by making them fit

2312.16335

Published 4/4/2024 by Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Ted Willke

LeanVec: Searching vectors faster by making them fit

Abstract

Modern deep learning models have the ability to generate high-dimensional vectors whose similarity reflects semantic resemblance. Thus, similarity search, i.e., the operation of retrieving those vectors in a large collection that are similar to a given query, has become a critical component of a wide range of applications that demand highly accurate and timely answers. In this setting, the high vector dimensionality puts similarity search systems under compute and memory pressure, leading to subpar performance. Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. However, these queries often have different distributions than the database embeddings, making it challenging to achieve high accuracy. In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy. We present LeanVec variants for in-distribution (ID) and out-of-distribution (OOD) queries. LeanVec-ID yields accuracies on par with those from recently introduced deep learning alternatives whose computational overhead precludes their usage in practice. LeanVec-OOD uses two novel techniques for dimensionality reduction that consider the query and database distributions to simultaneously boost the accuracy and the performance of the framework even further (even presenting competitive results when the query and database distributions match). All in all, our extensive and varied experimental results show that LeanVec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Introduces LeanVec, a framework to accelerate similarity search for high-dimensional vectors
Focuses on dimensionality reduction techniques to improve the efficiency of in-distribution similarity search
Proposes a novel algorithm called Dissimilar Vector Mining (DVM) to identify dissimilar vector pairs and facilitate dimensionality reduction
Demonstrates the effectiveness of LeanVec on various large-scale datasets, showing significant improvements in search speed and accuracy

Plain English Explanation

The paper presents LeanVec: a framework to accelerate similarity search for high-dimensional vectors, which aims to make searching through large collections of high-dimensional vectors faster and more efficient. High-dimensional vectors are commonly used to represent complex data like images, text, or multimedia, but searching through these large collections can be computationally expensive.

The key idea behind LeanVec is to use dimensionality reduction techniques to "shrink" the vectors while preserving their essential properties. This allows for faster similarity searches, as the reduced-size vectors take up less memory and can be processed more quickly. The paper introduces a novel algorithm called Dissimilar Vector Mining (DVM) to identify pairs of vectors that are very different from each other, which helps guide the dimensionality reduction process.

Through experiments on various large-scale datasets, the researchers demonstrate that LeanVec can significantly improve the speed and accuracy of similarity searches compared to existing approaches. This could have important applications in areas like language models that implement simple word2vec-style vector representations, retrieval-augmented question answering, and large-scale point cloud processing, where efficiently searching through high-dimensional data is crucial.

Technical Explanation

The paper introduces LeanVec, a framework for accelerating similarity search in high-dimensional vector spaces. The key insight is that by applying dimensionality reduction techniques, the vectors can be transformed into a more compact representation while preserving their essential properties. This allows for faster similarity search, as the reduced-size vectors take up less memory and can be processed more efficiently.

A core component of LeanVec is the Dissimilar Vector Mining (DVM) algorithm, which identifies pairs of vectors that are highly dissimilar. This information is then used to guide the dimensionality reduction process, ensuring that important distinguishing features are preserved. The paper also explores several dimensionality reduction techniques, including Principal Component Analysis (PCA) and Variational Autoencoder (VAE), and evaluates their impact on search performance.

The researchers evaluate LeanVec on various large-scale datasets, including text, image, and point cloud data. The results demonstrate that LeanVec can achieve significant improvements in search speed and accuracy compared to existing approaches, without compromising the quality of the similarity search results.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the LeanVec framework, exploring its effectiveness across a range of high-dimensional data modalities. The introduction of the Dissimilar Vector Mining (DVM) algorithm is a novel contribution that helps guide the dimensionality reduction process in a principled way.

However, the paper does not extensively discuss the potential limitations or caveats of the LeanVec approach. For example, it would be interesting to understand the performance of LeanVec on datasets with highly correlated or non-uniformly distributed vectors, as these could present additional challenges for the dimensionality reduction techniques.

Additionally, the paper could have delved deeper into the trade-offs between search speed, accuracy, and the degree of dimensionality reduction. It may be beneficial to explore how these factors scale as the size and complexity of the vector collections increase, and to provide guidance on how to best configure LeanVec for different application scenarios.

Further research could also investigate the compatibility of LeanVec with other related techniques, such as privacy-aware semantic caching for large language models or efficient multi-vector dense retrieval using bit-level transformations. Combining LeanVec with these approaches could lead to even more powerful and versatile high-dimensional vector search systems.

Conclusion

The LeanVec framework presented in this paper offers a promising approach to accelerating similarity search for high-dimensional vector data. By leveraging dimensionality reduction techniques, guided by the novel Dissimilar Vector Mining algorithm, LeanVec demonstrates significant improvements in search speed and accuracy across a diverse range of applications.

As the volume and complexity of high-dimensional data continue to grow, tools like LeanVec will become increasingly important for efficiently processing and retrieving relevant information. The insights and techniques introduced in this paper have the potential to benefit a wide range of fields, from natural language processing and computer vision to robotics and scientific data analysis.

While the paper provides a strong foundation, further research is needed to fully understand the limitations and explore potential synergies with other related techniques. By continuing to push the boundaries of high-dimensional vector search, the research community can unlock new possibilities and empower a wide range of data-driven applications.

Related Papers

Efficient Multi-Vector Dense Retrieval Using Bit Vectors

Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.

4/4/2024

cs.IR

🔎

Description-Based Text Similarity

Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg

Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.

4/29/2024

cs.CL cs.IR cs.LG

🏷️

Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

Nishant Yadav, Nicholas Monath, Manzil Zaheer, Rob Fergus, Andrew McCallum

Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.

5/7/2024

cs.IR cs.LG

Spatially Optimized Compact Deep Metric Learning Model for Similarity Search

Md. Farhadul Islam, Md. Tanzim Reza, Meem Arafat Manab, Mohammad Rakibul Hasan Mahin, Sarah Zabeen, Jannatun Noor

Spatial optimization is often overlooked in many computer vision tasks. Filters should be able to recognize the features of an object regardless of where it is in the image. Similarity search is a crucial task where spatial features decide an important output. The capacity of convolution to capture visual patterns across various locations is limited. In contrast to convolution, the involution kernel is dynamically created at each pixel based on the pixel value and parameters that have been learned. This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search. Additionally, we improve predictions by using the GELU activation function rather than the ReLU. The negligible amount of weight parameters in involution with a compact model with better performance makes the model very useful in real-world implementations. Our proposed model is below 1 megabyte in size. We have experimented with our proposed methodology and other models on CIFAR-10, FashionMNIST, and MNIST datasets. Our proposed method outperforms across all three datasets.

4/11/2024

cs.CV cs.AI cs.LG