Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

2207.02295

Published 6/4/2024 by Benjamin Fuhrer, Yuval Shpigelman, Chen Tessler, Shie Mannor, Gal Chechik, Eitan Zahavi, Gal Dalal

cs.NI cs.AI cs.LG

🏅

Abstract

As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the $mu$-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.

Create account to get full access

Overview

Datacenter networks are experiencing increased congestion due to evolving communication protocols and complex workloads.
Manually designing congestion control (CC) algorithms is becoming extremely difficult, calling for AI-based solutions.
However, deploying AI models on network devices is not currently feasible due to their limited computational capabilities.

Plain English Explanation

As datacenter networks become more complex, it's becoming harder for humans to design effective algorithms to manage the network traffic. The amount of data flowing through these networks is increasing, and the types of tasks being performed are getting more complicated. This leads to more frequent network congestion, which causes delays and lost packets.

To address this problem, researchers are exploring the use of AI-based approaches. The idea is to let AI systems figure out the best way to control the network traffic, rather than relying on manually-crafted rules. However, the challenge is that the network devices themselves don't have enough computing power to run these AI models in real-time.

Technical Explanation

The researchers in this paper present a solution to this problem. They took a recent reinforcement learning-based congestion control algorithm and transformed it into a much simpler, decision-tree-based model. This reduced the time it takes for the model to make decisions by 500 times, allowing it to run on the network devices without causing delays.

The researchers then deployed this transformed model on NVIDIA network interface cards (NICs) in a live cluster. They tested it against other popular congestion control algorithms used in production environments. The results showed that this AI-based approach, called RL-CC, outperformed the other methods across a wide range of scenarios, balancing factors like bandwidth, latency, and packet loss.

Critical Analysis

The paper presents a promising approach to bringing AI-based congestion control to real-world networks. By distilling a complex neural network into a decision-tree model, the researchers were able to overcome the computational limitations of network devices. However, the paper doesn't address the potential challenges of deploying and maintaining such a system in a production environment.

Additionally, the paper focuses on a single benchmark scenario. It would be valuable to see how the RL-CC algorithm performs in a wider range of network conditions and workloads, including potential edge cases or adversarial scenarios. Further research could also explore the scalability of the approach as the size and complexity of the network grows.

Conclusion

This research demonstrates that data-driven methods for congestion control, such as reinforcement learning, can outperform traditional, manually-crafted algorithms. By developing techniques to make these AI models lightweight enough to run on network devices, the researchers have taken an important step towards bringing the benefits of machine learning to real-world datacenter networks.

This work challenges the prior belief that optimal network performance can only be achieved through human-designed heuristics. As the complexity of network environments continues to grow, these types of AI-powered solutions may become increasingly crucial for maintaining reliable and efficient data communication.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Closed-form congestion control via deep symbolic regression

Jean Martins, Igor Almeida, Ricardo Souza, Silvia Lins

As mobile networks embrace the 5G era, the interest in adopting Reinforcement Learning (RL) algorithms to handle challenges in ultra-low-latency and high throughput scenarios increases. Simultaneously, the advent of packetized fronthaul networks imposes demanding requirements that traditional congestion control mechanisms cannot accomplish, highlighting the potential of RL-based congestion control algorithms. Although learning RL policies optimized for satisfying the stringent fronthaul requirements is feasible, the adoption of neural network models in real deployments still poses some challenges regarding real-time inference and interpretability. This paper proposes a methodology to deal with such challenges while maintaining the performance and generalization capabilities provided by a baseline RL policy. The method consists of (1) training a congestion control policy specialized in fronthaul-like networks via reinforcement learning, (2) collecting state-action experiences from the baseline, and (3) performing deep symbolic regression on the collected dataset. The proposed process overcomes the challenges related to inference-time limitations through closed-form expressions that approximate the baseline performance (link utilization, delay, and fairness) and which can be directly implemented in any programming language. Finally, we analyze the inner workings of the closed-form expressions.

5/3/2024

cs.NI cs.LG

Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

A. Fox, F. De Pellegrini, F. Faticanti, E. Altman, F. Bronzino

With the uptake of intelligent data-driven applications, edge computing infrastructures necessitate a new generation of admission control algorithms to maximize system performance under limited and highly heterogeneous resources. In this paper, we study how to optimally select information flows which belong to different classes and dispatch them to multiple edge servers where deployed applications perform flow analytic tasks. The optimal policy is obtained via constrained Markov decision process (CMDP) theory accounting for the demand of each edge application for specific classes of flows, the constraints on computing capacity of edge servers and of the access network. We develop DR-CPO, a specialized primal-dual Safe Reinforcement Learning (SRL) method which solves the resulting optimal admission control problem by reward decomposition. DR-CPO operates optimal decentralized control and mitigates effectively state-space explosion while preserving optimality. Compared to existing Deep Reinforcement Learning (DRL) solutions, extensive results show that DR-CPO achieves 15% higher reward on a wide variety of environments, while requiring on average only 50% of the amount of learning episodes to converge. Finally, we show how to match DR-CPO and load-balancing to dispatch optimally information streams to available edge servers and further improve system performance.

7/1/2024

cs.NI

Queue-aware Network Control Algorithm with a High Quantum Computing Readiness-Evaluated in Discrete-time Flow Simulator for Fat-Pipe Networks

Arthur Witt

The emerging technology of quantum computing has the potential to change the way how problems will be solved in the future. This work presents a centralized network control algorithm executable on already existing quantum computer which are based on the principle of quantum annealing like the D-Wave Advantage. We introduce a resource reoccupation algorithm for traffic engineering in wide-area networks. The proposed optimization algorithm changes traffic steering and resource allocation in case of overloaded transceivers. Settings of active components like fiber amplifiers and transceivers are not changed for the reason of stability. This algorithm is beneficial in situations when the network traffic is fluctuating in time scales of seconds or spontaneous bursts occur. Further, we developed a discrete-time flow simulator to study the algorithm's performance in wide-area networks. Our network simulator considers backlog and loss modeling of buffered transmission lines. Concurring flows are handled equally in case of a backlog. This work provides an ILP-based network configuring algorithm that is applicable on quantum annealing computers. We showcase, that traffic losses can be reduced significantly by a factor of 2 if a resource reoccupation algorithm is applied in a network with bursty traffic. As resources are used more efficiently by reoccupation in heavy load situations, overprovisioning of networks can be reduced. Thus, this new form of network operation leads toward a zero-margin network. We show that our newly introduced network simulator enables analyses of short-time effects like buffering within fat-pipe networks. As the calculation of network configurations in real-sized networks is typically time-consuming, quantum computing can enable the proposed network configuration algorithm for application in real-sized wide-area networks.

5/21/2024

eess.SY cs.ET cs.SY

📊

FNCC: Fast Notification Congestion Control in Data Center Networks

Jing Xu, Zhan Wang, Fan Yang, Ning Kang, Zhenlong Ma, Guojun Yuan, Guangming Tan, Ninghui Sun

Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps.

5/28/2024

cs.NI