ALock: Asymmetric Lock Primitive for RDMA Systems

Read original: arXiv:2404.17980 - Published 4/30/2024 by Amanda Baran, Jacob Nelson-Slivon, Lewis Tseng, Roberto Palmieri

❗

Overview

RDMA (Remote Direct Memory Access) networks are a rapidly growing technology in industry due to their high speed, low latency, and reduced CPU overhead compared to traditional TCP/IP networks.
RDMA allows threads to access remote memory without going through another process, but it does not guarantee atomicity between local and remote accesses, complicating synchronization.
The current solution of using loopback can degrade performance, so the authors introduce a new locking primitive called ALock to address this issue.

Plain English Explanation

RDMA is a technology that lets computers quickly access memory on other computers over a network, without having to go through the normal process of passing data back and forth. This is great for applications that need to move a lot of data around fast, like those used in NoTNets or RACS.

However, RDMA has a tricky problem - it doesn't guarantee that local memory accesses (on the same computer) and remote memory accesses (on other computers) will happen in the right order. This can make it hard to write programs that safely share data between computers.

The current solution is to force local memory accesses to go through the RDMA hardware using a "loopback" process, but this can slow things down a lot. The authors of this paper came up with a new locking system called ALock that lets programs synchronize local and remote memory accesses without needing loopback or complicated remote procedure calls. ALock is inspired by an old algorithm called Peterson's algorithm, and it uses a clever two-part design to manage both local and remote locks efficiently.

Technical Explanation

The authors introduce ALock, a novel locking primitive designed specifically for RDMA-based systems to address the challenge of synchronizing local and remote memory accesses. ALock draws inspiration from the classic Peterson's algorithm and uses a hierarchical design with embedded MCS locks for two cohorts - remote and local.

To evaluate ALock, the authors implement a distributed lock table and measure its throughput and latency across various cluster configurations and workloads. In workloads with a majority of local operations, ALock outperforms competing approaches by up to 29x in throughput and 20x in latency. This significant performance improvement is achieved by allowing threads to synchronize local and remote accesses without the need for costly loopback or remote procedure calls (RPCs).

The key insight behind ALock is that by leveraging the strengths of both local and remote locking mechanisms, it can provide efficient synchronization for a wide range of access patterns. The hierarchical design with embedded MCS locks ensures scalability and low overhead, making ALock a practical solution for RDMA-based systems like those used in ALLO or Holmes.

Critical Analysis

The authors provide a thorough evaluation of ALock's performance under various workloads and configurations, demonstrating its significant advantages over existing solutions. However, the paper does not discuss potential limitations or edge cases where ALock may not perform as well.

One area that could be explored further is the impact of different network topologies and hardware configurations on ALock's behavior. The experiments were conducted in a controlled environment, and it would be valuable to understand how ALock would perform in more diverse real-world deployments.

Additionally, the authors could have discussed the complexity of implementing ALock and the potential challenges developers may face when incorporating it into their RDMA-based systems. A deeper discussion of the tradeoffs and design decisions would help readers better understand the practical implications of adopting this new locking primitive.

Overall, the research presented in this paper is a valuable contribution to the field of RDMA-based systems, and the ALock locking primitive offers a promising solution to the challenge of synchronizing local and remote memory accesses.

Conclusion

This paper introduces ALock, a novel locking primitive designed to address the synchronization challenges in RDMA-based systems. By leveraging a hierarchical design with embedded MCS locks, ALock enables efficient synchronization of local and remote memory accesses without the need for costly loopback or remote procedure calls.

The authors' thorough evaluation demonstrates ALock's significant performance advantages over existing solutions, particularly in workloads with a majority of local operations. This innovation has the potential to unlock new opportunities for high-performance, low-latency distributed systems in a wide range of applications, from NoTNets to RACS and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

❗

ALock: Asymmetric Lock Primitive for RDMA Systems

Amanda Baran, Jacob Nelson-Slivon, Lewis Tseng, Roberto Palmieri

Remote direct memory access (RDMA) networks are being rapidly adopted into industry for their high speed, low latency, and reduced CPU overheads compared to traditional kernel-based TCP/IP networks. RDMA enables threads to access remote memory without interacting with another process. However, atomicity between local accesses and remote accesses is not guaranteed by the technology, hence complicating synchronization significantly. The current solution is to require threads wanting to access local memory in an RDMA-accessible region to pass through the RDMA card using a mechanism known as loopback, but this can quickly degrade performance. In this paper, we introduce ALock, a novel locking primitive designed for RDMA-based systems. ALock allows programmers to synchronize local and remote accesses without using loopback or remote procedure calls (RPCs). We draw inspiration from the classic Peterson's algorithm to create a hierarchical design that includes embedded MCS locks for two cohorts, remote and local. To evaluate the ALock we implement a distributed lock table, measuring throughput and latency in various cluster configurations and workloads. In workloads with a majority of local operations, the ALock outperforms competitors up to 29x and achieves a latency up to 20x faster.

4/30/2024

📊

Distributed Locking as a Data Type

Julian Haas (Technische Universitat Darmstadt), Ragnar Mogk (Technische Universitat Darmstadt), Annette Bieniusa (University of Kaiserslautern-Landau), Mira Mezini (Technische Universitat Darmstadt)

Mixed-consistency programming models assist programmers in designing applications that provide high availability while still ensuring application-specific safety invariants. However, existing models often make specific system assumptions, such as building on a particular database system or having baked-in coordination strategies. This makes it difficult to apply these strategies in diverse settings, ranging from client/server to ad-hoc peer-to-peer networks. This work proposes a new strategy for building programmable coordination mechanisms based on the algebraic replicated data types (ARDTs) approach. ARDTs allow for simple and composable implementations of various protocols, while making minimal assumptions about the network environment. As a case study, two different locking protocols are presented, both implemented as ARDTs. In addition, we elaborate on our ongoing efforts to integrate the approach into the LoRe mixed-consistency programming language.

5/27/2024

SeqBalance: Congestion-Aware Load Balancing with no Reordering for RoCE

Huimin Luo, Jiao Zhang, Mingxuan Yu, Yongchen Pan, Tian Pan, Tao Huang

Remote Direct Memory Access (RDMA) is widely used in data center networks because of its high performance. However, due to the characteristics of RDMA's retransmission strategy and the traffic mode of AI training, current load balancing schemes for data center networks are unsuitable for RDMA. In this paper, we propose SeqBalance, a load balancing framework designed for RDMA. SeqBalance implements fine-grained load balancing for RDMA through a reasonable design and does not cause reordering problems. SeqBalance's designs are all based on existing commercial RNICs and commercial programmable switches, so they are compatible with existing data center networks. We have implemented SeqBalance in Mellanox CX-6 RNICs and Tofino switches. The results of hardware testbed experiments and large-scale simulations show that compared with existing load balancing schemes, SeqBalance improves 18.7% and 33.2% on average FCT and 99th percentile FCT.

7/16/2024

DistR: Language-Guided Distributed Shared Memory with Fine Granularity, Full Transparency, and Ultra Efficiency

Haoran Ma, Yifan Qiao, Shi Liu, Shan Yu, Yuanjiang Ni, Qingda Lu, Jiesheng Wu, Yiying Zhang, Miryung Kim, Harry Xu

Despite being a powerful concept, distributed shared memory (DSM) has not been made practical due to the extensive synchronization needed between servers to implement memory coherence. This paper shows a practical DSM implementation based on the insight that the ownership model embedded in programming languages such as Rust automatically constrains the order of read and write, providing opportunities for significantly simplifying the coherence implementation if the ownership semantics can be exposed to and leveraged by the runtime. This paper discusses the design and implementation of DistR, a Rust-based DSM system that outperforms the two state-of-the-art DSM systems GAM and Grappa by up to 2.64x and 29.16x in throughput, and scales much better with the number of servers.

7/1/2024