DDS: DPU-optimized Disaggregated Storage

Read original: arXiv:2407.13618 - Published 8/29/2024 by Qizhen Zhang, Philip Bernstein, Badrish Chandramouli, Jiasheng Hu, Yiming Zheng

🚀

Overview

This paper presents DDS, a novel disaggregated storage architecture that leverages emerging DPUs (Data Processing Units) to optimize the performance of disaggregated storage systems.
DDS aims to improve the throughput and latency of disaggregated storage while reducing the CPU consumption on the host servers.
The design of DDS focuses on minimizing overhead through the use of DMA, zero-copy, and userspace I/O, as well as an offload engine that can execute client requests directly on the DPU.
The paper presents experimental results and a production system integration that demonstrate the benefits of DDS, including higher throughput and lower latency compared to traditional disaggregated storage solutions.

Plain English Explanation

Disaggregated storage is a way of separating the storage hardware from the servers that use it. This can be more efficient than having storage directly attached to each server, but it can also introduce some performance challenges.

The key idea behind DDS is to use a special type of networking hardware called a DPU (Data Processing Unit) to help improve the performance of disaggregated storage. DPUs are designed to handle certain networking and storage tasks more efficiently than a regular CPU.

By carefully designing the network and storage paths, as well as the interface between the storage system and the database management system (DBMS), DDS is able to significantly improve the throughput and latency of disaggregated storage. It does this by using advanced techniques like DMA (Direct Memory Access), zero-copy data transfer, and running client requests directly on the DPU instead of the host CPU.

The end result is a disaggregated storage system that can provide higher performance with lower CPU usage on the host servers, which can translate to cost savings and more efficient resource utilization.

Technical Explanation

DDS is designed to leverage the capabilities of DPUs (Data Processing Units) to optimize the performance of disaggregated storage servers. DPUs can offload networking and storage tasks from the host CPUs, reducing the computational burden on the servers.

To fully benefit from DPUs, DDS heavily utilizes DMA, zero-copy, and userspace I/O to minimize overhead and improve throughput. It also introduces an offload engine that can execute client requests directly on the DPU, further reducing the load on the host CPUs.

The paper describes the careful design of the network and storage paths, as well as the interface exposed to the DBMS, to enable efficient utilization of the DPU's capabilities. This includes optimizations like native data streaming between devices and scalable range indexing in disaggregated memory.

The experimental evaluation and production system integration presented in the paper demonstrate the benefits of DDS. Compared to traditional disaggregated storage solutions, DDS achieves higher throughput, lower latency, and can save tens of CPU cores per storage server.

Critical Analysis

The paper provides a thorough technical description of the DDS architecture and its performance advantages. However, it does not discuss certain potential limitations or areas for further research:

The paper does not explore the impact of different workloads or data access patterns on the performance of DDS. It would be interesting to see how DDS performs under various DBMS use cases.
The paper also does not address potential scalability concerns as the number of disaggregated storage servers and clients increases. Distributed shared memory systems may provide insights into scalability challenges.
The paper also lacks a detailed discussion of the integration and compatibility of DDS with existing DBMS architectures. Heterogeneous and composable disaggregated systems may offer relevant perspectives on this topic.

Overall, the paper presents a compelling solution for improving the performance of disaggregated storage, but further research on its broader applicability and scalability would be valuable.

Conclusion

The DDS architecture presented in this paper demonstrates the potential of emerging DPU (Data Processing Unit) hardware to enhance the performance of disaggregated storage systems. By carefully designing the network and storage paths, as well as the interface to the DBMS, DDS is able to achieve significant improvements in throughput and latency while reducing the CPU burden on the host servers.

The experimental results and production system integration showcased in the paper suggest that DDS could be a valuable tool for organizations looking to improve the efficiency and cost-effectiveness of their disaggregated storage infrastructure. As the adoption of DPUs and other specialized hardware continues to grow, solutions like DDS may become increasingly important for optimizing the performance of modern data-intensive applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

DDS: DPU-optimized Disaggregated Storage

Qizhen Zhang, Philip Bernstein, Badrish Chandramouli, Jiasheng Hu, Yiming Zheng

This extended report presents DDS, a novel disaggregated storage architecture enabled by emerging networking hardware, namely DPUs (Data Processing Units). DPUs can optimize the latency and CPU consumption of disaggregated storage servers. However, utilizing DPUs for DBMSs requires careful design of the network and storage paths and the interface exposed to the DBMS. To fully benefit from DPUs, DDS heavily uses DMA, zero-copy, and userspace I/O to minimize overhead when improving throughput. It also introduces an offload engine that eliminates host CPUs by executing client requests directly on the DPU. Adopting DDS' API requires minimal DBMS modification. Our experimental study and production system integration show promising results -- DDS achieves higher disaggregated storage throughput with an order of magnitude lower latency, and saves up to tens of CPU cores per storage server.

8/29/2024

📊

DPDPU: Data Processing with DPUs

Jiasheng Hu, Philip A. Bernstein, Jialin Li, Qizhen Zhang

Improving the performance and reducing the cost of cloud data systems is increasingly challenging. Data processing units (DPUs) are a promising solution, but utilizing them for data processing needs characterizing the new hardware and recognizing their capabilities and constraints. We hence propose DPDPU, a platform for holistically exploiting DPUs to optimize data processing tasks that are critical to performance and cost. It seeks to fill the semantic gap between DPUs and data processing systems and handle DPU heterogeneity with three engines dedicated to compute, networking, and storage. This paper describes our vision, DPDPU's key components, their associated utilization challenges, as well as the current progress and future plans.

7/30/2024

📊

Towards Disaggregation-Native Data Streaming between Devices

Nils Asmussen, Michael Roitzsch

Disaggregation is an ongoing trend to increase flexibility in datacenters. With interconnect technologies like CXL, pools of CPUs, accelerators, and memory can be connected via a datacenter fabric. Applications can then pick from those pools the resources necessary for their specific workload. However, this vision becomes less clear when we consider data movement. Workloads often require data to be streamed through chains of multiple devices, but typically, these data streams physically do not directly flow device-to-device, but are staged in memory by a CPU hosting device protocol logic. We show that augmenting devices with a disaggregation-native and device-independent data streaming facility can improve processing latencies by enabling data flows directly between arbitrary devices.

6/17/2024

🏷️

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

Baotong Lu, Kaisong Huang, Chieh-Jan Mike Liang, Tianzheng Wang, Eric Lo

Memory disaggregation can potentially allow memory-optimized range indexes such as B+-trees to scale beyond one machine while attaining high hardware utilization and low cost. Designing scalable indexes on disaggregated memory, however, is challenging due to rudimentary caching, unprincipled offloading and excessive inconsistency among servers. This paper proposes DEX, a new scalable B+-tree for memory disaggregation. DEX includes a set of techniques to reduce remote accesses, including logical partitioning, lightweight caching and cost-aware offloading. Our evaluation shows that DEX can outperform the state-of-the-art by 1.7--56.3X, and the advantage remains under various setups, such as cache size and skewness.

5/24/2024