Towards Disaggregation-Native Data Streaming between Devices

Read original: arXiv:2406.09421 - Published 6/17/2024 by Nils Asmussen, Michael Roitzsch

📊

Overview

Disaggregation is a trend in data centers to increase flexibility by connecting pools of CPUs, accelerators, and memory via a data center fabric.
This allows applications to access the resources they need for their specific workloads.
However, data movement between these disaggregated devices can be a challenge, as data streams often have to be staged in memory by a CPU hosting device protocol logic.
The paper proposes a solution to enable direct data flows between arbitrary devices, improving processing latencies.

Plain English Explanation

In modern data centers, there is a trend towards [object Object], which means separating different computing resources like CPUs, memory, and accelerators into pools that can be accessed as needed by applications. This flexibility is enabled by interconnect technologies like CXL.

However, a challenge with this approach is how to efficiently move data between the different devices. Typically, when an application needs to process data, that data has to be copied from one device to another, with the CPU acting as an intermediary to handle the protocol logic. This can add latency and complexity to the data flows.

The paper proposes a solution to this problem by giving the devices themselves the ability to directly stream data between each other, without having to go through the CPU. This "disaggregation-native" data streaming facility can improve processing times by allowing data to flow more directly between the devices that need it.

Technical Explanation

The paper explores the challenges of data movement in disaggregated data center architectures. In a typical setup, workloads require data to be streamed through chains of multiple devices, but this data does not physically flow directly between the devices. Instead, the data is staged in memory by a CPU that is hosting the device protocol logic.

To address this, the authors propose augmenting devices with a disaggregation-native and device-independent data streaming facility. This allows data to flow directly between arbitrary devices, without having to go through the CPU. The authors demonstrate that this approach can improve processing latencies compared to traditional CPU-mediated data streams.

The paper describes experiments that evaluate the proposed data streaming facility, including use cases like accelerating time-to-science by streaming detector data and scaling range indexing in disaggregated memory. The results show significant latency improvements, highlighting the benefits of a disaggregation-native data streaming approach.

Critical Analysis

The paper addresses an important challenge in the field of disaggregated computing and provides a promising solution. However, the authors acknowledge that their work is focused on a specific data streaming use case and may not generalize to all disaggregated workloads.

Additionally, the paper does not explore the potential impact on other system-level metrics, such as power consumption or throughput, which could be important considerations for real-world deployments. Further research may be needed to understand the broader implications and trade-offs of the proposed approach.

Conclusion

This paper presents a novel approach to data movement in disaggregated data center architectures. By equipping devices with a disaggregation-native data streaming facility, the authors demonstrate significant improvements in processing latency compared to traditional CPU-mediated data flows.

The proposed solution has the potential to unlock new use cases and accelerate the adoption of disaggregated computing, which is an important trend in modern data centers. As research in this area continues, the insights and techniques described in this paper could have a lasting impact on the design of future disaggregated systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Towards Disaggregation-Native Data Streaming between Devices

Nils Asmussen, Michael Roitzsch

Disaggregation is an ongoing trend to increase flexibility in datacenters. With interconnect technologies like CXL, pools of CPUs, accelerators, and memory can be connected via a datacenter fabric. Applications can then pick from those pools the resources necessary for their specific workload. However, this vision becomes less clear when we consider data movement. Workloads often require data to be streamed through chains of multiple devices, but typically, these data streams physically do not directly flow device-to-device, but are staged in memory by a CPU hosting device protocol logic. We show that augmenting devices with a disaggregation-native and device-independent data streaming facility can improve processing latencies by enabling data flows directly between arbitrary devices.

6/17/2024

💬

Proceedings of 3rd Workshop on Heterogeneous Composable and Disaggregated Systems

Christian Pinto, Dong Li, Thaleia Dimitra Doudali, Christina Giannoula, Jie Ren

The future of computing systems is inevitably embracing a disaggregated and composable pattern: from clusters of computers to pools of resources that can be dynamically combined together and tailored around applications requirements. Transitioning to this new paradigm requires ground-breaking research, ranging from new hardware architectures up to new models and abstractions at all levels of the software stack. Recent hardware advancements in CPU and interconnection technologies, enabled the possibility of disaggregating peripherals and system memory. The memory system heterogeneity is further increasing, composability and disaggregation are beneficial to increase memory capacity and improve memory utilization in a cost-effective way, and reduce total cost of ownership. Heterogeneous and Composable Disaggregated Systems (HCDS) provide a system design approach for reducing the imbalance between workloads resource requirements and the static availability of resources in a computing system. The HCDS workshop aims at exploring the novel research ideas around composable disaggregated systems and their integration with operating systems and software runtimes to maximize the benefit perceived from user workloads.

7/2/2024

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Xu Zhang, Ke Liu, Yisong Chang, Hui Yuan, Xiaolong Zheng, Ke Zhang, Mingyu Chen

Emerging interconnects, such as CXL and NVLink, have been integrated into the intra-host topology to scale more accelerators and facilitate efficient communication between them, such as GPUs. To keep pace with the accelerator's growing computing throughput, the interconnect has seen substantial enhancement in link bandwidth, e.g., 256GBps for CXL 3.0 links, which surpasses Ethernet and InfiniBand network links by an order of magnitude or more. Consequently, when data-intensive jobs, such as LLM training, scale across multiple hosts beyond the reach limit of the interconnect, the performance is significantly hindered by the limiting bandwidth of the network infrastructure. We address the problem by proposing DFabric, a two-tier interconnect architecture. We address the problem by proposing DFabric, a two-tier interconnect architecture. First, DFabric disaggregates rack's computing units with an interconnect fabric, i.e., CXL fabric, which scales at rack-level, so that they can enjoy intra-rack efficient interconnecting. Second, DFabric disaggregates NICs from hosts, and consolidates them to form a NIC pool with CXL fabric. By providing sufficient aggregated capacity comparable to interconnect bandwidth, the NIC pool bridges efficient communication across racks or beyond the reach limit of interconnect fabric. However, the local memory accessing becomes the bottleneck when enabling each host to utilize the NIC pool efficiently. To the end, DFabric builds a memory pool with sufficient bandwidth by disaggregating host local memory and adding more memory devices. We have implemented a prototype of DFabric that can run applications transparently. We validated its performance gain by running various microbenchmarks and compute-intensive applications such as DNN and graph.

9/10/2024

A Programming Model for Disaggregated Memory over CXL

Gal Assa, Michal Friedman, Ori Lahav

CXL (Compute Express Link) is an emerging open industry-standard interconnect between processing and memory devices that is expected to revolutionize the way systems are designed in the near future. It enables cache-coherent shared memory pools in a disaggregated fashion at unprecedented scales, allowing algorithms to interact with a variety of storage devices using simple loads and stores in a cacheline granularity. Alongside with unleashing unique opportunities for a wide range of applications, CXL introduces new challenges of data management and crash consistency. Alas, CXL lacks an adequate programming model, which makes reasoning about the correctness and expected behaviors of algorithms and systems on top of it nearly impossible. In this work, we present CXL0, the first programming model for concurrent programs running on top of CXL. We propose a high-level abstraction for CXL memory accesses and formally define operational semantics on top of that abstraction. We provide a set of general transformations that adapt concurrent algorithms to the new disruptive technology. Using these transformations, every linearizable algorithm can be easily transformed into its provably correct version in the face of a full-system or sub-system crash. We believe that this work will serve as the stepping stone for systems design and modelling on top of CXL, and support the development of future models as software and hardware evolve.

7/24/2024