Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?

Read original: arXiv:2409.06464 - Published 9/11/2024 by Jimmy Lin

🌐

Overview

Provides operational advice for dense and sparse retrievers like HNSW, Flat, and Inverted Indexes
Examines the trade-offs between these different retrieval approaches in terms of performance, memory usage, and index construction
Offers guidance on when to use each type of retriever based on the characteristics of the data and the specific requirements of the application

Plain English Explanation

This paper aims to help developers and researchers choose the right type of retriever for their needs when building search or recommendation systems. It compares three common approaches: HNSW, Flat, and Inverted Indexes.

HNSW (Hierarchical Navigable Small World) is a dense retriever that organizes data into a hierarchical structure for efficient nearest neighbor search. Flat retrievers simply store all data in a flat table, while Inverted Indexes are sparse retrievers that optimize for fast lookups of specific terms or keywords.

The paper looks at the trade-offs between these approaches in terms of factors like:

Performance: how quickly they can retrieve relevant results
Memory usage: how much storage space they require
Index construction: how much time and effort is needed to build the index

Based on these factors, the paper provides guidance on when to use each type of retriever. For example, HNSW may be best for applications with high-dimensional data and a need for fast approximate search, while Inverted Indexes excel at exact keyword-based retrieval on large text corpora.

By understanding the strengths and weaknesses of these different retrieval approaches, developers can make more informed choices about which one to use for their specific use case.

Technical Explanation

The paper compares the performance characteristics of three main types of retrievers:

HNSW (Hierarchical Navigable Small World): A dense retriever that organizes data into a hierarchical structure for efficient nearest neighbor search. HNSW offers fast approximate search but requires more memory.
Flat Retrievers: A simple approach that stores all data in a flat table. Flat retrievers have lower memory usage but slower search performance.
Inverted Indexes: A sparse retriever that optimizes for fast lookups of specific terms or keywords. Inverted Indexes excel at exact keyword-based retrieval on large text corpora.

The paper examines the trade-offs between these approaches across several dimensions:

Performance: HNSW offers the fastest approximate nearest neighbor search, followed by Flat and then Inverted Indexes.
Memory Usage: Inverted Indexes are the most memory-efficient, followed by Flat and then HNSW.
Index Construction: Building an Inverted Index is the most lightweight process, while HNSW requires the most upfront work.

Based on these trade-offs, the paper provides guidance on when to use each type of retriever:

HNSW: Best for high-dimensional data and applications requiring fast approximate search, e.g. recommendation systems.
Flat Retrievers: Suitable for smaller datasets or applications where memory usage is constrained.
Inverted Indexes: Excel at exact keyword-based retrieval on large text corpora, e.g. search engines.

The paper also discusses how factors like data ordering and intrinsic dimensionality can impact the performance of these retrieval approaches.

Critical Analysis

The paper provides a comprehensive comparison of three widely used retrieval approaches, highlighting their key strengths and weaknesses. This information is valuable for developers and researchers who need to choose the right retriever for their specific use case.

One potential limitation is that the analysis is primarily based on theoretical considerations and existing literature, rather than empirical experiments conducted by the authors. While the authors reference relevant prior studies, an in-depth evaluation of the performance of these retrievers on real-world datasets could have provided additional insights.

Additionally, the paper does not address some emerging retrieval techniques, such as learned index structures (e.g., LexBoost and DeepImpact), which may offer different trade-offs and could be worth considering in certain scenarios.

Overall, the paper offers a solid foundation for understanding the operational characteristics of dense and sparse retrievers, but further research and empirical evaluations could help refine the guidance provided and expand the coverage of modern retrieval approaches.

Conclusion

This paper provides valuable operational advice for developers and researchers working on search and recommendation systems. By comparing the performance, memory usage, and index construction characteristics of HNSW, Flat, and Inverted Index retrievers, the paper helps inform the choice of the most suitable retrieval approach for a given application.

The insights offered in this paper can help improve the efficiency and effectiveness of a wide range of information retrieval systems, from search engines to personalized recommendations. As the field of information retrieval continues to evolve, this type of practical guidance will remain crucial for ensuring that systems are designed and deployed in an optimal manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?

Jimmy Lin

Practitioners working on dense retrieval today face a bewildering number of choices. Beyond selecting the embedding model, another consequential choice is the actual implementation of nearest-neighbor vector search. While best practices recommend HNSW indexes, flat vector indexes with brute-force search represent another viable option, particularly for smaller corpora and for rapid prototyping. In this paper, we provide experimental results on the BEIR dataset using the open-source Lucene search library that explicate the tradeoffs between HNSW and flat indexes (including quantized variants) from the perspectives of indexing time, query evaluation performance, and retrieval quality. With additional comparisons between dense and sparse retrievers, our results provide guidance for today's search practitioner in understanding the design space of dense and sparse retrievers. To our knowledge, we are the first to provide operational advice supported by empirical experiments in this regard.

9/11/2024

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney

Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document's lexical relevance scores and its neighbors' scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.

9/11/2024

🚀

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

4/30/2024

The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

Owen Pendrigh Elliott, Jesse Clark

Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in developing robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding

5/29/2024