DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Read original: arXiv:2409.05404 - Published 9/10/2024 by Xu Zhang, Ke Liu, Yisong Chang, Hui Yuan, Xiaolong Zheng, Ke Zhang, Mingyu Chen

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Overview

The paper introduces DFabric, a system that scales out data-parallel applications using a hybrid CXL-Ethernet interconnect.
DFabric aims to enable efficient scaling of data-parallel workloads by combining the high-bandwidth of CXL with the flexibility and scalability of Ethernet.
The paper presents the DFabric architecture, evaluates its performance, and discusses the benefits and limitations of the hybrid interconnect approach.

Plain English Explanation

DFabric is a new system designed to help run certain types of data-processing applications more efficiently. These applications, known as "data-parallel" workloads, can be split up and run across multiple computers at the same time to speed things up. However, getting all of those computers to work together smoothly can be challenging.

DFabric tries to solve this problem by using a combination of two different communication technologies: CXL and Ethernet. CXL is a high-speed interconnect that can move data quickly between different components inside a single computer. Ethernet is a more flexible network technology that can connect many computers together.

By using both CXL and Ethernet, DFabric aims to get the best of both worlds - the speed of CXL for communication within each computer, and the scalability of Ethernet for connecting multiple computers together. This hybrid approach is designed to allow data-parallel applications to run more efficiently and scale up to use many computers at once.

The paper describes the architecture of DFabric, how it works, and presents some test results showing its performance. It also discusses the benefits and limitations of this hybrid interconnect approach compared to other ways of running data-parallel workloads.

Technical Explanation

The paper introduces DFabric, a system that combines the high-bandwidth of CXL with the flexibility and scalability of Ethernet to enable efficient scaling of data-parallel workloads.

The key idea behind DFabric is to leverage the strengths of both CXL and Ethernet interconnects. CXL provides ultra-low latency and high bandwidth for intra-node communication, while Ethernet offers a flexible and scalable way to connect multiple nodes together. By seamlessly integrating these two interconnects, DFabric aims to achieve the best of both worlds.

The DFabric architecture consists of two main components:

CXL-Ethernet Gateway: This component bridges the CXL and Ethernet domains, allowing data to flow between them efficiently.
DFabric Runtime: This software layer manages the orchestration of data-parallel applications across the DFabric network, handling tasks like data partitioning, task scheduling, and load balancing.

The paper evaluates the performance of DFabric using various data-parallel benchmarks and compares it to alternative approaches, such as using pure Ethernet or a disaggregated memory architecture like emuCXL. The results demonstrate that DFabric can achieve significant speedups, particularly for applications with large memory footprints or high communication requirements.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated system in DFabric. The authors have clearly identified the challenges of scaling out data-parallel applications and have proposed a novel solution that leverages the strengths of both CXL and Ethernet interconnects.

One potential limitation of the DFabric approach is the reliance on the CXL-Ethernet Gateway, which could become a bottleneck if not properly designed and optimized. The paper does not delve deeply into the implementation details of this critical component, and future work could explore ways to further improve its performance and scalability.

Additionally, the paper focuses on evaluating DFabric's performance with synthetic benchmarks and relatively simple data-parallel applications. It would be valuable to see how DFabric fares with more complex, real-world data-parallel workloads, such as large-scale machine learning or data processing pipelines.

Despite these minor limitations, the DFabric system presents a promising approach to scaling out data-parallel applications, and the paper provides valuable insights into the design and performance tradeoffs of hybrid CXL-Ethernet interconnects. Further research and development in this area could lead to significant advancements in the field of distributed computing and data-intensive applications.

Conclusion

The DFabric paper introduces a novel system that combines the high-bandwidth of CXL with the flexibility and scalability of Ethernet to enable efficient scaling of data-parallel applications. By seamlessly integrating these two interconnect technologies, DFabric aims to achieve the best of both worlds, providing low-latency intra-node communication and flexible, scalable inter-node connectivity.

The paper's thorough evaluation and discussion of the DFabric architecture and performance offer valuable insights for researchers and engineers working on distributed computing systems and data-intensive applications. While the paper identifies some potential limitations, the overall approach of leveraging hybrid interconnects represents an exciting direction for further exploration and development in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Xu Zhang, Ke Liu, Yisong Chang, Hui Yuan, Xiaolong Zheng, Ke Zhang, Mingyu Chen

Emerging interconnects, such as CXL and NVLink, have been integrated into the intra-host topology to scale more accelerators and facilitate efficient communication between them, such as GPUs. To keep pace with the accelerator's growing computing throughput, the interconnect has seen substantial enhancement in link bandwidth, e.g., 256GBps for CXL 3.0 links, which surpasses Ethernet and InfiniBand network links by an order of magnitude or more. Consequently, when data-intensive jobs, such as LLM training, scale across multiple hosts beyond the reach limit of the interconnect, the performance is significantly hindered by the limiting bandwidth of the network infrastructure. We address the problem by proposing DFabric, a two-tier interconnect architecture. We address the problem by proposing DFabric, a two-tier interconnect architecture. First, DFabric disaggregates rack's computing units with an interconnect fabric, i.e., CXL fabric, which scales at rack-level, so that they can enjoy intra-rack efficient interconnecting. Second, DFabric disaggregates NICs from hosts, and consolidates them to form a NIC pool with CXL fabric. By providing sufficient aggregated capacity comparable to interconnect bandwidth, the NIC pool bridges efficient communication across racks or beyond the reach limit of interconnect fabric. However, the local memory accessing becomes the bottleneck when enabling each host to utilize the NIC pool efficiently. To the end, DFabric builds a memory pool with sufficient bandwidth by disaggregating host local memory and adding more memory devices. We have implemented a prototype of DFabric that can run applications transparently. We validated its performance gain by running various microbenchmarks and compute-intensive applications such as DNN and graph.

9/10/2024

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, Torsten Hoefler

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

8/27/2024

📊

Towards Disaggregation-Native Data Streaming between Devices

Nils Asmussen, Michael Roitzsch

Disaggregation is an ongoing trend to increase flexibility in datacenters. With interconnect technologies like CXL, pools of CPUs, accelerators, and memory can be connected via a datacenter fabric. Applications can then pick from those pools the resources necessary for their specific workload. However, this vision becomes less clear when we consider data movement. Workloads often require data to be streamed through chains of multiple devices, but typically, these data streams physically do not directly flow device-to-device, but are staged in memory by a CPU hosting device protocol logic. We show that augmenting devices with a disaggregation-native and device-independent data streaming facility can improve processing latencies by enabling data flows directly between arbitrary devices.

6/17/2024

emucxl: an emulation framework for CXL-based disaggregated memory applications

Raja Gond, Purushottam Kulkarni

The emergence of CXL (Compute Express Link) promises to transform the status of interconnects between host and devices and in turn impact the design of all software layers. With its low overhead, low latency, and memory coherency capabilities, CXL has the potential to improve the performance of existing devices while making viable new operational use cases (e.g., disaggregated memory pools, cache coherent memory across devices etc.). The focus of this work is design of applications and middleware with use of CXL for supporting disaggregated memory. A vital building block for solutions in this space is the availability of a standard CXL hardware and software platform. Currently, CXL devices are not commercially available, and researchers often rely on custom-built hardware or emulation techniques and/or use customized software interfaces and abstractions. These techniques do not provide a standard usage model and abstraction layer for CXL usage, and developers and researchers have to reinvent the CXL setup to design and test their solutions, our work aims to provide a standardized view of the CXL emulation platform and the software interfaces and abstractions for disaggregated memory. This standardization is designed and implemented as a user space library, emucxl and is available as a virtual appliance. The library provides a user space API and is coupled with a NUMA-based CXL emulation backend. Further, we demonstrate usage of the standardized API for different use cases relying on disaggregated memory and show that generalized functionality can be built using the open source emucxl library.

4/15/2024