Scaling Data Plane Verification with Intent-based Slicing

Read original: arXiv:2405.20982 - Published 6/3/2024 by Kuan-Yen Chou, Santhosh Prabhu, Giri Subramanian, Wenxuan Zhou, Aanand Nayyar, Brighten Godfrey, Matthew Caesar

📊

Overview

Verifying the correctness of network data planes is crucial for ensuring network reliability and security.
Existing approaches use monolithic network models, which have high memory requirements and limited scalability.
This paper introduces Scylla, a new data plane verification system that provides fine-grained scalability without the need for a single, large network model.

Plain English Explanation

Networks, like the internet, have complex data planes that route and process data. Verifying that these data planes are working correctly is essential to ensure the network is reliable and secure. However, existing approaches to data plane verification have significant limitations.

The main issue is that these existing methods use a single, large model to represent the entire network. As networks grow larger and more complex, this monolithic model becomes very memory-intensive and difficult to scale. The current solution of simply distributing the work across multiple machines is too limited in its ability to capture the full complexity of real-world networks.

Scylla takes a different approach. Instead of a single, large model, Scylla creates "intent-based slices" - smaller, more targeted models that each focus on verifying a specific set of network behaviors or "intents." These sliced models are then distributed across a cluster of machines, allowing Scylla to verify large, complex networks much more efficiently.

The key idea is that the scaling problem becomes tied to the size of these intent-based slices, rather than the size of the entire network. This enables Scylla to verify networks using much smaller units of work, requiring far less memory and time than previous techniques.

Technical Explanation

Scylla is a distributed data plane verification system that addresses the scalability limitations of existing monolithic approaches. Instead of a single, large network model, Scylla creates fine-grained "intent-based slices" - smaller models focused on verifying specific network behaviors or "intents."

These sliced models are distributed across a cluster and incrementally updated as the network changes. This allows Scylla to verify large, complex networks using much smaller units of work, significantly reducing the memory and time required compared to past techniques.

The key innovation is that Scylla's scaling is tied to the size of the intent-based slices, rather than the size of the entire network. This enables Scylla to scale out verification in a more granular and efficient manner, without the need for a single, monolithic network model.

Scylla's experiments demonstrate its ability to verify large, complex networks much more efficiently than previous approaches. By breaking the problem into smaller, more manageable pieces, Scylla is able to leverage distributed computing resources to provide fast, scalable data plane verification.

Critical Analysis

The paper presents a compelling approach to addressing the scalability challenges of existing data plane verification methods. By introducing the concept of "intent-based slices," Scylla offers a promising solution to the memory and computational requirements of verifying large, complex networks.

However, the paper does not discuss potential limitations or areas for further research in depth. For example, it would be useful to understand how Scylla's performance scales as the number of network "intents" increases, or how the system handles highly dynamic network changes that require frequent updates to the distributed models.

Additionally, the paper could benefit from a more thorough discussion of the trade-offs and potential drawbacks of the Scylla approach. While the results demonstrate impressive performance gains, it is important to consider any potential downsides or edge cases that may arise in real-world deployments.

Researchers may also want to explore ways to integrate Scylla with other verification or network slicing techniques to further enhance the system's capabilities and robustness.

Conclusion

The Scylla data plane verification system represents a significant advancement in the field of network correctness validation. By moving away from monolithic network models and instead leveraging fine-grained "intent-based slices," Scylla provides a more scalable and efficient approach to verifying the behavior of large, complex networks.

The paper's findings suggest that Scylla's distributed, incremental verification model has the potential to greatly improve the reliability and security of critical network infrastructure, ultimately benefiting a wide range of applications and services that rely on robust and trustworthy network operations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Scaling Data Plane Verification with Intent-based Slicing

Kuan-Yen Chou, Santhosh Prabhu, Giri Subramanian, Wenxuan Zhou, Aanand Nayyar, Brighten Godfrey, Matthew Caesar

Data plane verification has grown into a powerful tool to ensure network correctness. However, existing monolithic data plane models have high memory requirements with large networks, and the existing method of scaling out is too limited in expressiveness to capture practical network features. In this paper, we describe Scylla, a general data plane verifier that provides fine-grained scale-out without the need for a monolithic network model. Scylla creates models for what we call intent-based slices, each of which is constructed at a fine (rule-level) granularity with just enough to verify a given set of intents. The sliced models are retained in memory across a cluster and are incrementally updated in a distributed compute cluster in response to network updates. Our experiments show that Scylla makes the scaling problem more granular -- tied to the size of the intent-based slices rather than that of the overall network. This enables Scylla to verify large, complex networks in minimum units of work that are significantly smaller (in both memory and time) than past techniques, enabling fast scale-out verification with minimal resource requirement.

6/3/2024

🤔

Scalable, Interpretable Distributed Protocol Verification by Inductive Proof Slicing

William Schultz, Edward Ashton, Heidi Howard, Stavros Tripakis

Many techniques for automated inference of inductive invariants for distributed protocols have been developed over the past several years, but their performance can still be unpredictable and their failure modes opaque for large-scale verification tasks. In this paper, we present inductive proof slicing, a new automated, compositional technique for inductive invariant inference that scales effectively to large distributed protocol verification tasks. Our technique is built on a core, novel data structure, the inductive proof graph, which explicitly represents the lemma and action dependencies of an inductive invariant and is built incrementally during the inference procedure, backwards from a target safety property. We present an invariant inference algorithm that integrates localized syntax-guided lemma synthesis routines at nodes of this graph, which are accelerated by computation of localized grammar and state variable slices. Additionally, in the case of failure to produce a complete inductive invariant, maintenance of this proof graph structure allows failures to be localized to small sub-components of this graph, enabling fine-grained failure diagnosis and repair by a user. We evaluate our technique on several complex distributed and concurrent protocols, including a large scale specification of the Raft consensus protocol, which is beyond the capabilities of modern distributed protocol verification tools, and also demonstrate how its interpretability features allow effective diagnosis and repair in cases of initial failure.

4/30/2024

A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

Kamran Razavi, Mehran Salmani, Max Muhlhauser, Boris Koldehofe, Lin Wang

Inference serving is of great importance in deploying machine learning models in real-world applications, ensuring efficient processing and quick responses to inference requests. However, managing resources in these systems poses significant challenges, particularly in maintaining performance under varying and unpredictable workloads. Two primary scaling strategies, horizontal and vertical scaling, offer different advantages and limitations. Horizontal scaling adds more instances to handle increased loads but can suffer from cold start issues and increased management complexity. Vertical scaling boosts the capacity of existing instances, allowing for quicker responses but is limited by hardware and model parallelization capabilities. This paper introduces Themis, a system designed to leverage the benefits of both horizontal and vertical scaling in inference serving systems. Themis employs a two-stage autoscaling strategy: initially using in-place vertical scaling to handle workload surges and then switching to horizontal scaling to optimize resource efficiency once the workload stabilizes. The system profiles the processing latency of deep learning models, calculates queuing delays, and employs different dynamic programming algorithms to solve the joint horizontal and vertical scaling problem optimally based on the workload situation. Extensive evaluations with real-world workload traces demonstrate over $10times$ SLO violation reduction compared to the state-of-the-art horizontal or vertical autoscaling approaches while maintaining resource efficiency when the workload is stable.

7/23/2024

GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

Vipul Gupta, Xin Chen, Ruoyun Huang, Fanlong Meng, Jianjun Chen, Yujun Yan

Graph Neural Networks (GNNs) have emerged as powerful tools for supervised machine learning over graph-structured data, while sampling-based node representation learning is widely utilized in unsupervised learning. However, scalability remains a major challenge in both supervised and unsupervised learning for large graphs (e.g., those with over 1 billion nodes). The scalability bottleneck largely stems from the mini-batch sampling phase in GNNs and the random walk sampling phase in unsupervised methods. These processes often require storing features or embeddings in memory. In the context of distributed training, they require frequent, inefficient random access to data stored across different workers. Such repeated inter-worker communication for each mini-batch leads to high communication overhead and computational inefficiency. We propose GraphScale, a unified framework for both supervised and unsupervised learning to store and process large graph data distributedly. The key insight in our design is the separation of workers who store data and those who perform the training. This separation allows us to decouple computing and storage in graph training, thus effectively building a pipeline where data fetching and data computation can overlap asynchronously. Our experiments show that GraphScale outperforms state-of-the-art methods for distributed training of both GNNs and node embeddings. We evaluate GraphScale both on public and proprietary graph datasets and observe a reduction of at least 40% in end-to-end training times compared to popular distributed frameworks, without any loss in performance. While most existing methods don't support billion-node graphs for training node embeddings, GraphScale is currently deployed in production at TikTok enabling efficient learning over such large graphs.

7/23/2024