RACS and SADL: Towards Robust SMR in the Wide-Area Network

2404.04183

Published 4/8/2024 by Pasindu Tennage, Antoine Desjardins, Lefteris Kokoris-Kogias

🌐

Abstract

Consensus algorithms deployed in the crash fault tolerant setting chose a leader-based architecture in order to achieve the lowest latency possible. However, when deployed in the wide area they face two key robustness challenges. First, they lose liveness when the network is unreliable because they rely on timeouts to find a leader. Second, they cannot have a high replication factor because of the high load imposed on the leader-replica making it a bottleneck. This effectively limits the replication factor allowed, for a given level of throughput, thus lowering the fault tolerance threshold. In this paper, we propose RACS and SADL, a modular state machine replication algorithm that addresses these two robustness challenges. To achieve robustness under adversarial network conditions, we propose RACS, a novel crash fault-tolerant consensus algorithm. RACS consists of two modes of operations: synchronous and asynchronous, that always ensure liveness. RACS leverages the synchronous network to minimize the communication cost to O(n) and matches the lower bound of O(n2) at adversarial-case executions. To avoid the leader bottleneck and to allow higher replication factor, without sacrificing the throughput, we then propose SADL, a novel consensus-agnostic asynchronous dissemination layer. SADL separates client command dissemination from the critical path of consensus and distributes the overhead evenly among all the replicas. The combination of RACS and SADL (SADL-RACS) provides a robust and high-performing state machine replication system. We implement and evaluate RACS and SADL-RACS in a wide-area deployment running on Amazon EC2.

Create account to get full access

Overview

Consensus algorithms deployed in crash fault-tolerant settings often use a leader-based architecture to achieve low latency.
However, these algorithms face two key robustness challenges when deployed in wide-area networks:
1. They lose liveness when the network is unreliable, as they rely on timeouts to find a leader.
2. They cannot have a high replication factor due to the high load on the leader-replica, which becomes a bottleneck.

Plain English Explanation

Consensus algorithms are used to ensure that multiple computers (or "nodes") in a network agree on a common set of data or decisions. When deployed in settings where some nodes may fail (known as "crash fault-tolerant" environments), these algorithms often use a "leader-based" architecture to achieve low response times.

In a leader-based system, one node is designated as the "leader" and is responsible for coordinating the other nodes (called "replicas"). This approach can work well in local, reliable networks. However, when deployed across a wide-area network, such as the internet, these algorithms face two key challenges:

Liveness Issues: The algorithms rely on "timeouts" to detect when the leader has failed and a new one needs to be chosen. In an unreliable wide-area network, these timeouts can cause the system to incorrectly think the leader has failed, even when it hasn't, leading to a loss of "liveness" (the ability to make progress).
Bottlenecks: The leader node in a leader-based system becomes a bottleneck, as all the replicas need to communicate with it. This limits the maximum number of replicas that can be used (the "replication factor") without compromising the system's throughput (the amount of data it can process).

Technical Explanation

The paper proposes two novel algorithms to address these robustness challenges:

RACS (Robust Asynchronous Consensus Substrates): RACS is a crash fault-tolerant consensus algorithm that operates in two modes: synchronous and asynchronous. This allows it to maintain liveness even in the face of unreliable network conditions. RACS minimizes the communication cost in the synchronous mode to O(n) and matches the lower bound of O(n^2) in the asynchronous, adversarial-case executions.
SADL (Scalable Asynchronous Dissemination Layer): SADL is a consensus-agnostic asynchronous dissemination layer that separates client command dissemination from the critical path of consensus. This distributes the overhead evenly among all the replicas, avoiding the leader bottleneck and allowing for a higher replication factor without sacrificing throughput.

The paper combines RACS and SADL into a system called SADL-RACS, which provides a robust and high-performing state machine replication solution. The authors implement and evaluate RACS and SADL-RACS in a wide-area deployment on Amazon EC2.

Critical Analysis

The paper addresses important challenges faced by consensus algorithms in wide-area networks, such as loss of liveness and leader bottlenecks. The proposed solutions, RACS and SADL, appear to be well-designed and address these challenges effectively.

However, the paper does not discuss some potential limitations or areas for further research:

The performance and scalability of RACS and SADL-RACS in very large-scale deployments (e.g., hundreds or thousands of nodes) is not evaluated.
The paper does not explore the impact of more severe network conditions, such as high packet loss rates or network partitions, on the algorithms' behavior.
The paper does not compare the proposed solutions to other recent consensus algorithms designed for wide-area networks, such as those discussed in Secure Link-State Routing in Mobile Ad-Hoc Networks, DASA: Delay-Adaptive Multi-Agent Stochastic Approximation, or Distributed Autonomous Swarm Formation in Dynamic Network Bridging.

Conclusion

This paper presents two novel algorithms, RACS and SADL, that address key robustness challenges faced by consensus algorithms in wide-area network deployments. By maintaining liveness under unreliable network conditions and avoiding leader bottlenecks, these algorithms enable more robust and scalable state machine replication systems. The evaluation of RACS and SADL-RACS in a wide-area setting suggests that the proposed solutions can be effectively deployed in real-world distributed systems, with potential applications in areas such as Cooperative Sensing and Communication in ISAC Networks: Performance Analysis and Adversary-Augmented Simulation to Evaluate Fairness in Hyperledger.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Jolteon and Ditto: Network-Adaptive Efficient Consensus with Asynchronous Fallback

Rati Gelashvili, Lefteris Kokoris-Kogias, Alberto Sonnino, Alexander Spiegelman, Zhuolun Xiang

Existing committee-based Byzantine state machine replication (SMR) protocols, typically deployed in production blockchains, face a clear trade-off: (1) they either achieve linear communication cost in the happy path, but sacrifice liveness during periods of asynchrony, or (2) they are robust (progress with probability one) but pay quadratic communication cost. We believe this trade-off is unwarranted since existing linear protocols still have asymptotic quadratic cost in the worst case. We design Ditto, a Byzantine SMR protocol that enjoys the best of both worlds: optimal communication on and off the happy path (linear and quadratic, respectively) and progress guarantee under asynchrony and DDoS attacks. We achieve this by replacing the view-synchronization of partially synchronous protocols with an asynchronous fallback mechanism at no extra asymptotic cost. Specifically, we start from HotStuff, a state-of-the-art linear protocol, and gradually build Ditto. As a separate contribution and an intermediate step, we design a 2-chain version of HotStuff, Jolteon, which leverages a quadratic view-change mechanism to reduce the latency of the standard 3-chain HotStuff. We implement and experimentally evaluate all our systems. Notably, Jolteon's commit latency outperforms HotStuff by 200-300ms with varying system size. Additionally, Ditto adapts to the network and provides better performance than Jolteon under faulty conditions and better performance than VABA (a state-of-the-art asynchronous protocol) under faultless conditions. This proves our case that breaking the robustness-efficiency trade-off is in the realm of practicality.

5/2/2024

cs.DC cs.CR

🏅

D-CAST: Distributed Consensus Switch in Wireless Trustworthy Autonomous System

Dachao Yu, Jiayuan Ma, Hao Xu

The protocols of distributed consensus normally aim to tolerate different types of faults including crash faults and byzantine faults that occur in the distributed systems. However, the dynamic network topology and stochastic wireless channels may cause the same trustworthy system to suffer both crash fault and byzantine fault. This article proposes the concept of a distributed consensus autonomous switch mechanism in trustworthy autonomous systems (D-CAST) to reach the different fault tolerance requirements of the dynamic nodes and discusses the challenges of D-CAST while it is implemented in the wireless trustworthy system.

5/15/2024

cs.DC

Synchronous Consensus in Partial Synchrony

Ivan Klianev

We demonstrate a deterministic Byzantine consensus algorithm with synchronous operation in partial synchrony. It is naturally leaderless, tolerates any number of $ f<n/2 $ Byzantine processes with 2 rounds of exchange of originator-only signed messages, and terminates within a bounded interval of time. The algorithm is resilient to transient faults and asynchrony in a fraction of links with known size per number of faulty processes. It circumvents asynchronous and faulty links with 3-hop epidemic dissemination. Key finding: the resilience to asynchrony of links and the enabled by it leaderless consensus in partial synchrony ensure algorithm operation with simultaneous validity, safety, and bounded liveness.

5/16/2024

cs.DC

🔗

Secure Link State Routing for Mobile Ad Hoc Networks

Panagiotis Papadimitratos, Zygmunt J. Haas

The secure operation of the routing protocol is one of the major challenges to be met for the proliferation of the Mobile Ad hoc Networking (MANET) paradigm. Nevertheless, security enhancements have been proposed mostly for reactive MANET protocols. The proposed here Secure Link State Routing Protocol (SLSP) provides secure proactive topology discovery, which can be multiply beneficial to the network operation. SLSP can be employed as a stand-alone protocol, or fit naturally into a hybrid routing framework, when combined with a reactive protocol. SLSP is robust against individual attackers, it is capable of adjusting its scope between local and network-wide topology discovery, and it is capable of operating in networks of frequently changing topology and membership.

4/1/2024

cs.CR