DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

Read original: arXiv:2405.14502 - Published 5/24/2024 by Baotong Lu, Kaisong Huang, Chieh-Jan Mike Liang, Tianzheng Wang, Eric Lo

🏷️

Overview

This paper proposes a new scalable B+-tree called DEX for memory disaggregation.
Memory disaggregation allows memory-optimized range indexes like B+-trees to scale beyond a single machine while maintaining high hardware utilization and low cost.
Designing scalable indexes on disaggregated memory is challenging due to issues like rudimentary caching, inconsistent offloading, and excessive inconsistency across servers.
DEX includes techniques to reduce remote accesses, such as logical partitioning, lightweight caching, and cost-aware offloading.
The evaluation shows DEX can outperform the state-of-the-art by 1.7-56.3X under various setups.

Plain English Explanation

Disaggregated memory is a way to scale memory beyond a single machine by separating it from the CPU. This could allow more powerful B+-tree index structures to be used, but there are challenges in making this work efficiently.

The DEX system proposed in this paper aims to solve these challenges. It includes techniques to reduce the number of times data needs to be accessed remotely, which is slow. This includes logically partitioning the data, using lightweight caching, and being smart about when to offload work to other servers.

The researchers found that DEX can significantly outperform other state-of-the-art systems for this problem, by up to 56 times in some cases. This shows the value of their innovations in making disaggregated memory work well for these types of indexing workloads.

Technical Explanation

The paper proposes a new scalable B+-tree called DEX that is designed for memory disaggregation. Memory disaggregation allows memory-optimized data structures like B+-trees to scale beyond a single machine, but introduces challenges like rudimentary caching, unprincipled offloading, and excessive inconsistency across servers.

DEX includes several techniques to address these issues and reduce the number of remote memory accesses:

Logical Partitioning: DEX logically partitions the B+-tree index into multiple partitions, each assigned to a different server. This reduces remote accesses by keeping more data local.
Lightweight Caching: DEX uses a lightweight caching mechanism to cache frequently accessed nodes, further reducing remote accesses.
Cost-Aware Offloading: DEX employs a cost model to intelligently decide when to offload processing to remote servers, balancing the cost of remote access with the benefits of parallelism.

The paper evaluates DEX against state-of-the-art systems, showing it can provide 1.7-56.3X better performance under various setups, including different cache sizes and data skewness. This demonstrates the effectiveness of DEX's techniques for enabling scalable B+-tree indexing on disaggregated memory.

Critical Analysis

The paper provides a comprehensive solution for addressing the challenges of building scalable index structures on disaggregated memory systems. The techniques proposed, such as logical partitioning, lightweight caching, and cost-aware offloading, seem well-designed and effective based on the evaluation results.

However, the paper does not delve into the potential downsides or limitations of the DEX approach. For example, it's unclear how the logical partitioning scheme handles skewed data distributions or dynamic changes in the data. Additionally, the cost model used for offloading decisions may not capture all relevant factors, leading to suboptimal decisions in some scenarios.

Further research could explore the robustness of DEX under more diverse workloads and conditions, as well as investigate ways to make the caching and offloading mechanisms more adaptive and self-tuning. Decentralized approaches like DE-DSI may also be worth considering in this context.

Overall, the DEX system represents a significant advancement in enabling scalable index structures on disaggregated memory, but there may be room for further innovation and refinement to address the remaining challenges in this area.

Conclusion

This paper presents DEX, a new scalable B+-tree design for memory disaggregation. By incorporating techniques like logical partitioning, lightweight caching, and cost-aware offloading, DEX is able to significantly outperform the state-of-the-art in terms of performance.

The key contribution of this work is demonstrating the feasibility and potential benefits of using memory-optimized index structures like B+-trees in a disaggregated memory setting. This could pave the way for more efficient and scalable data management systems in the future, potentially unlocking new applications and use cases.

While the DEX approach shows promise, further research is needed to explore its robustness and limitations. Nonetheless, this paper represents an important step forward in the ongoing effort to bridge the gap between the potential of disaggregated memory and the practical challenges of building high-performance distributed data structures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

Baotong Lu, Kaisong Huang, Chieh-Jan Mike Liang, Tianzheng Wang, Eric Lo

Memory disaggregation can potentially allow memory-optimized range indexes such as B+-trees to scale beyond one machine while attaining high hardware utilization and low cost. Designing scalable indexes on disaggregated memory, however, is challenging due to rudimentary caching, unprincipled offloading and excessive inconsistency among servers. This paper proposes DEX, a new scalable B+-tree for memory disaggregation. DEX includes a set of techniques to reduce remote accesses, including logical partitioning, lightweight caching and cost-aware offloading. Our evaluation shows that DEX can outperform the state-of-the-art by 1.7--56.3X, and the advantage remains under various setups, such as cache size and skewness.

5/24/2024

🚀

DDS: DPU-optimized Disaggregated Storage

Qizhen Zhang, Philip Bernstein, Badrish Chandramouli, Jiasheng Hu, Yiming Zheng

This extended report presents DDS, a novel disaggregated storage architecture enabled by emerging networking hardware, namely DPUs (Data Processing Units). DPUs can optimize the latency and CPU consumption of disaggregated storage servers. However, utilizing DPUs for DBMSs requires careful design of the network and storage paths and the interface exposed to the DBMS. To fully benefit from DPUs, DDS heavily uses DMA, zero-copy, and userspace I/O to minimize overhead when improving throughput. It also introduces an offload engine that eliminates host CPUs by executing client requests directly on the DPU. Adopting DDS' API requires minimal DBMS modification. Our experimental study and production system integration show promising results -- DDS achieves higher disaggregated storage throughput with an order of magnitude lower latency, and saves up to tens of CPU cores per storage server.

8/29/2024

SELCC: Coherent Caching over Compute-Limited Disaggregated Memory

Ruihong Wang, Jianguo Wang, Walid G. Aref

Disaggregating memory from compute offers the opportunity to better utilize stranded memory in data centers. It is important to cache data in the compute nodes and maintain cache coherence across multiple compute nodes to save on round-trip communication cost between the disaggregated memory and the compute nodes. However, the limited computing power on the disaggregated memory servers makes it challenging to maintain cache coherence among multiple compute-side caches over disaggregated shared memory. This paper introduces SELCC; a Shared-Exclusive Latch Cache Coherence protocol that maintains cache coherence without imposing any computational burden on the remote memory side. SELCC builds on a one-sided shared-exclusive latch protocol by introducing lazy latch release and invalidation messages among the compute nodes so that it can guarantee both data access atomicity and cache coherence. SELCC minimizes communication round-trips by embedding the current cache copy holder IDs into RDMA latch words and prioritizes local concurrency control over global concurrency control. We instantiate the SELCC protocol onto compute-sided cache, forming an abstraction layer over disaggregated memory. This abstraction layer provides main-memory-like APIs to upper-level applications, and thus enabling existing data structures and algorithms to function over disaggregated memory with minimal code change. To demonstrate the usability of SELCC, we implement a B-tree and three transaction concurrency control algorithms over SELCC's APIs. Micro-benchmark results show that the SELCC protocol achieves better performance compared to RPC-based cache-coherence protocols. Additionally, YCSB and TPC-C benchmarks indicate that applications over SELCC can achieve comparable or superior performance against competitors over disaggregated memory.

9/6/2024

Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

Rongxin Cheng, Yifan Peng, Xingda Wei, Hongrui Xie, Rong Chen, Sijie Shen, Haibo Chen

Vector searches on large-scale datasets are critical to modern online services like web search and RAG, which necessity storing the datasets and their index on the secondary storage like SSD. In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5.7$times$ and 1.7$times$, these indexes have to pay a 5.8$times$ storage amplification and 7.7$times$ with respect to the dataset size, respectively. The root cause is that the coarse-grained access of SSD mismatches the fine-grained random read required by vector indexes with small amplification. This paper argues that second-tier memory, such as remote DRAM/NVM connected via RDMA or CXL, is a powerful storage for addressing the problem from a system's perspective, thanks to its fine-grained access granularity. However, putting existing indexes -- primarily designed for SSD -- directly on second-tier memory cannot fully utilize its power. Meanwhile, second-tier memory still behaves more like storage, so using it as DRAM is also inefficient. To this end, we build a graph and cluster index that centers around the performance features of second-tier memory. With careful execution engine and index layout designs, we show that vector indexes can achieve optimal performance with orders of magnitude smaller index amplification, on a variety of second-tier memory devices. Based on our improved graph and vector indexes on second-tier memory, we further conduct a systematic study between them to facilitate developers choosing the right index for their workloads. Interestingly, the findings on the second-tier memory contradict the ones on SSDs.

5/8/2024