Intrusion Tolerance for Networked Systems through Two-Level Feedback Control

2404.01741

Published 6/6/2024 by Kim Hammar, Rolf Stadler

👨‍🏫

Abstract

We formulate intrusion tolerance for a system with service replicas as a two-level optimal control problem. On the local level node controllers perform intrusion recovery, and on the global level a system controller manages the replication factor. The local and global control problems can be formulated as classical problems in operations research, namely, the machine replacement problem and the inventory replenishment problem. Based on this formulation, we design TOLERANCE, a novel control architecture for intrusion-tolerant systems. We prove that the optimal control strategies on both levels have threshold structure and design efficient algorithms for computing them. We implement and evaluate TOLERANCE in an emulation environment where we run 10 types of network intrusions. The results show that TOLERANCE can improve service availability and reduce operational cost compared with state-of-the-art intrusion-tolerant systems.

Create account to get full access

Overview

Researchers formulate intrusion tolerance as a two-level optimal control problem
Local node controllers perform intrusion recovery, while a global system controller manages replication factor
This is based on established operations research problems like machine replacement and inventory replenishment
The researchers design a novel control architecture called TOLERANCE and show it can improve service availability and reduce costs

Plain English Explanation

Keeping computer systems safe from attacks, or "intrusions," is a major challenge. The researchers in this paper tackle this problem by developing a system that can automatically recover from intrusions while also optimizing the use of computing resources.

Imagine you have a bunch of servers that provide an online service. If one of those servers gets hacked, the researchers' system can quickly identify the problem and move the service to a different, uncompromised server. This local-level recovery happens automatically, without needing human intervention.

At a higher level, the system also manages how many backup servers, or "replicas," are available. Having more replicas improves the chances of maintaining service when attacks happen, but also costs more to run. So the global-level controller carefully adjusts the number of replicas to balance availability and cost.

The researchers model these local and global control problems using well-known techniques from operations research. This allows them to prove that their control strategies have an optimal "threshold" structure, meaning they can be computed efficiently.

By implementing this TOLERANCE system, the researchers show they can keep services running reliably in the face of various network attacks, while also reducing the overall operational costs compared to other intrusion-tolerant systems.

Technical Explanation

The core idea is to formulate intrusion tolerance as a two-level optimal control problem. At the local level, node controllers perform intrusion recovery using a "machine replacement problem" model. When a node is compromised, the controller decides whether to repair it or replace it with a backup.

At the global level, a system controller manages the number of service replicas, or backup nodes, using an "inventory replenishment problem" model. The controller monitors the replica count and adjusts it as needed to balance availability and cost.

The researchers prove that the optimal control strategies at both levels have a threshold structure. This means there are clear cutoffs for when to repair vs. replace a node, and when to add or remove replicas. They develop efficient algorithms to compute these optimal thresholds.

The researchers implement TOLERANCE and evaluate it in an emulation environment, subjecting the system to 10 different network intrusion scenarios. The results demonstrate that TOLERANCE can improve service availability and reduce operational costs compared to other intrusion-tolerant architectures.

Critical Analysis

The paper provides a rigorous mathematical formulation of the intrusion tolerance problem and shows how established operations research techniques can be applied to solve it effectively. The threshold-based control strategies derived by the researchers seem well-suited for practical implementation.

However, the evaluation is limited to a simulated environment, so further real-world testing would be needed to validate the performance claims. The intrusion scenarios considered are also relatively narrow in scope, focusing on network-based attacks. Broader classes of intrusions, such as those targeting system vulnerabilities or insider threats, are not addressed.

Additionally, the paper does not discuss potential single points of failure in the global controller, or how the system would handle scenarios where the controller itself becomes compromised. Robustness to such failures could be an important area for future research.

Conclusion

This research presents a novel approach to building intrusion-tolerant systems by formulating the problem as a two-level optimal control problem. The TOLERANCE architecture leverages established techniques from operations research to achieve efficient, threshold-based control strategies for both local recovery and global resource management.

The results demonstrate the potential for this approach to improve service availability and reduce operational costs compared to existing solutions. While further real-world validation is needed, the paper offers a promising framework for developing more resilient and cost-effective intrusion-tolerant systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Tolerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems

Changjian Zhang, Parv Kapoor, Eunsuk Kang, Romulo Meira-Goes, David Garlan, Akila Ganlath, Shatadal Mishra, Nejib Ammar

Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.

6/26/2024

eess.SY cs.AI cs.LO cs.RO cs.SY

Resource Optimization for Tail-Based Control in Wireless Networked Control Systems

Rasika Vijithasena, Rafaela Scaciota, Mehdi Bennis, Sumudu Samarakoon

Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.

6/21/2024

eess.SY cs.LG cs.SY

🛸

Collaborative Safety-Critical Control for Networked Dynamic Systems

Brooks A. Butler, Philip E. Par'e

As modern systems become ever more connected with complex dynamic coupling relationships, the development of safe control methods for such networked systems becomes paramount. In this paper, we define a general networked model with coupled dynamics and local control and discuss the relationship of node-level safety definitions for individual agents with local neighborhood dynamics. We define a node-level barrier function (NBF), node-level control barrier function (NCBF), and collaborative node-level barrier function (cNCBF) and provide conditions under which sets defined by these functions will be forward invariant. We use collaborative node-level barrier functions to construct a novel distributed algorithm for the safe control of collaborating network agents and provide conditions under which the algorithm is guaranteed to converge to a viable set of safe control actions for all agents or a terminally infeasible state for at least one agent. We introduce the notion of non-compliance of network neighbors as a metric of robustness for collaborative safety for a given network state and chosen barrier function hyper-parameters. We illustrate these results on a networked susceptible-infected-susceptible (SIS) model.

5/2/2024

cs.MA cs.SY eess.SY

🛠️

Online Stackelberg Optimization via Nonlinear Control

William Brown, Christos Papadimitriou, Tim Roughgarden

In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy textit{local controllability}, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient $O(sqrt{T})$ regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally textit{action-linear}. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for textit{strongly} or textit{weakly} locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.

6/28/2024

cs.LG cs.GT