CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

Read original: arXiv:2308.15136 - Published 7/10/2024 by Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, Yong Wang

🧪

Overview

Approximate Nearest Neighbor Search (ANNS) is a critical technique used in various fields like data mining, artificial intelligence, information retrieval, and recommender systems
As data volumes have grown, the computational cost of exact nearest neighbor search has become prohibitive, leading to the adoption of approximate techniques
Graph-based approaches have gained significant attention in ANNS algorithms, but few studies have explored leveraging the power of GPUs and multi-core processors
This paper introduces a novel parallel computing hardware-based proximity graph and search algorithm that achieves remarkable efficiency gains

Plain English Explanation

Approximate Nearest Neighbor Search (ANNS) is a technique used in many different areas of technology, like information retrieval and recommendation systems. The idea is to quickly find the data points that are most similar to a given input, without having to do a full, exhaustive comparison.

As the amount of data we work with has grown dramatically in recent years, the computational cost of doing an exact nearest neighbor search has become too high. That's why approximate techniques have become so important.

One popular approach for ANNS is to use graph-based algorithms. These build a graph structure to efficiently search for similar data points. However, not many studies have looked at how to take advantage of powerful parallel computing hardware, like GPUs and multi-core CPUs, to make these graph-based ANNS methods run even faster.

This paper introduces a new ANNS algorithm that is specifically designed to leverage the capabilities of modern parallel computing hardware. By using these high-performance chips, the researchers were able to achieve some remarkable speed improvements compared to existing ANNS methods.

Technical Explanation

The paper presents a novel parallel computing hardware-based proximity graph and search algorithm called CAGRA (Concurrent Asynchronous Graph Augmented). By harnessing the power of GPUs and multi-core CPUs, CAGRA is able to outperform existing CPU and GPU-based ANNS approaches in both constructing the proximity graph and performing large- and small-batch searches.

In terms of graph construction time, CAGRA is 2.2 to 27 times faster than HNSW, which is considered one of the state-of-the-art CPU-based ANNS implementations. For large-batch query throughput in the 90-95% recall range, CAGRA is 33 to 77 times faster than HNSW, and 3.8 to 8.8 times faster than the leading GPU-based methods. Even for single queries, CAGRA is 3.4 to 53 times faster than HNSW at 95% recall.

These impressive speed gains are achieved by leveraging the parallel processing capabilities of modern hardware. CAGRA is able to efficiently distribute the computation across multiple GPU cores and CPU threads, allowing it to construct the proximity graph and perform searches much more quickly than sequential, single-threaded approaches.

Critical Analysis

The paper does a thorough job of evaluating CAGRA's performance against state-of-the-art ANNS methods on a variety of datasets and workloads. The results clearly demonstrate the advantages of the parallel hardware-accelerated approach.

That said, the paper does not delve deeply into the potential limitations or caveats of the CAGRA algorithm. For example, it's unclear how the method would scale to extremely large datasets that may not fit entirely in GPU or CPU memory. There is also no discussion of the preprocessing and indexing overhead required for CAGRA, which could be significant.

Additionally, the paper does not explore potential fairness or bias issues that could arise from the use of CAGRA in applications like recommendation systems or image retrieval. The accuracy and reliability of the approximate nearest neighbor search results should also be further investigated, especially in high-stakes domains.

Overall, the paper presents an impressive technical achievement, but would benefit from a more critical and well-rounded analysis of the potential limitations and societal implications of the proposed ANNS approach.

Conclusion

This paper introduces a novel parallel computing hardware-accelerated ANNS algorithm called CAGRA that significantly outperforms existing CPU and GPU-based methods in both graph construction and search query performance. By effectively leveraging the power of modern GPUs and multi-core CPUs, CAGRA is able to achieve remarkable efficiency gains, which could have important implications for a wide range of applications that rely on fast and accurate nearest neighbor search.

However, the paper does not fully address potential limitations of the approach, such as scalability, preprocessing overhead, and potential bias and reliability issues. Further research is needed to better understand the tradeoffs and broader implications of using hardware-accelerated ANNS techniques in real-world settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧪

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, Yong Wang

Approximate Nearest Neighbor Search (ANNS) plays a critical role in various disciplines spanning data mining and artificial intelligence, from information retrieval and computer vision to natural language processing and recommender systems. Data volumes have soared in recent years and the computational cost of an exhaustive exact nearest neighbor search is often prohibitive, necessitating the adoption of approximate techniques. The balanced performance and recall of graph-based approaches have more recently garnered significant attention in ANNS algorithms, however, only a few studies have explored harnessing the power of GPUs and multi-core processors despite the widespread use of massively parallel and general-purpose computing. To bridge this gap, we introduce a novel parallel computing hardware-based proximity graph and search algorithm. By leveraging the high-performance capabilities of modern hardware, our approach achieves remarkable efficiency gains. In particular, our method surpasses existing CPU and GPU-based methods in constructing the proximity graph, demonstrating higher throughput in both large- and small-batch searches while maintaining compatible accuracy. In graph construction time, our method, CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA implementations. In large-batch query throughput in the 90% to 95% recall range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the SOTA implementations for GPU. For a single query, our method is 3.4~53x faster than HNSW at 95% recall.

7/10/2024

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Karthik V., Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.

7/22/2024

RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, X. Sean Wang

Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality (e.g., texts) as the query to retrieve the most similar items from another (e.g., images or videos). However, there is an inherent distribution gap between embeddings from different modalities, and cross-modal queries become Out-of-Distribution (OOD) to the base data. Consequently, state-of-the-art ANNS approaches suffer poor performance for OOD workloads. In this paper, we quantitatively analyze the properties of the OOD workloads to gain an understanding of their ANNS efficiency. Unlike single-modal workloads, we reveal OOD queries spatially deviate from base data, and the k-nearest neighbors of an OOD query are distant from each other in the embedding space. The property breaks the assumptions of existing ANNS approaches and mismatches their design for efficient search. With insights from the OOD workloads, we propose pRojected bipartite Graph (RoarGraph), an efficient ANNS graph index built under the guidance of query distribution. Extensive experiments show that RoarGraph significantly outperforms state-of-the-art approaches on modern cross-modal datasets, achieving up to 3.56x faster search speed at a 90% recall rate for OOD queries.

8/20/2024

A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search

Yiping Sun, Yang Shi, Jiaolong Du

In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.

8/7/2024