Tomur: Traffic-Aware Performance Prediction of On-NIC Network Functions with Multi-Resource Contention

Read original: arXiv:2405.05529 - Published 6/3/2024 by Shaofeng Wu, Qiang Su, Zhixiong Niu, Hong Xu

🚀

Overview

Modern data centers use network function (NF) offloading on SmartNICs to save host resources and improve programmability.
Co-running NFs on the same SmartNICs can cause performance interference due to onboard resource contention.
Operators need mechanisms to predict NF performance under such contention to meet performance SLAs and manage resources efficiently.
Existing solutions lack SmartNIC-specific knowledge and traffic awareness, leading to poor accuracy for on-NIC NFs.

Plain English Explanation

The paper describes a new system called Tomur that can predict the performance of network functions (NFs) running on specialized network cards called SmartNICs. SmartNICs are used in modern data centers to offload certain network processing tasks from the main servers, saving resources on those servers and making the network more programmable.

However, when multiple NFs are run on the same SmartNIC, they can interfere with each other's performance due to competition for the limited resources on the SmartNIC, such as accelerators and memory. This can make it difficult for the data center operators to ensure that the network is meeting the required performance targets (called Service Level Agreements or SLAs).

Existing solutions for predicting NF performance in this situation have limitations - they don't have enough understanding of how SmartNICs work, and they don't account for how the network traffic patterns affect the performance. This can lead to inaccurate predictions.

The Tomur system proposed in this paper tries to address these limitations. It is designed with a deep understanding of SmartNIC internals and how different types of network traffic can impact the various resources on the SmartNIC. This allows Tomur to make more accurate predictions of NF performance, even as the network traffic changes. The paper's evaluation shows that Tomur significantly improves prediction accuracy and reduces SLA violations compared to existing approaches.

Technical Explanation

The key observation driving Tomur's design is that co-located NFs on a SmartNIC contend for multiple onboard resources, including accelerators and the memory subsystem. Tomur leverages this insight to build a performance prediction system tailored for on-NIC NFs.

Tomur consists of two main components: a performance model and a traffic analyzer. The performance model uses machine learning to capture the complex relationships between NF co-location, resource utilization, and performance. The traffic analyzer facilitates traffic awareness by monitoring the behaviors of individual resources and updating the performance model accordingly as external traffic attributes change.

To evaluate Tomur, the researchers used a BlueField-2 SmartNIC and various network functions. Compared to state-of-the-art approaches, Tomur improved prediction accuracy by 78.8% and reduced SLA violations by 92.2%. This enables new practical use cases, such as dynamic resource allocation and proactive traffic management, that were previously difficult to achieve.

Critical Analysis

The paper provides a thorough evaluation of Tomur's performance, including comparisons to existing solutions. However, the authors acknowledge that Tomur's accuracy may degrade if the network functions or hardware change significantly from the tested configurations. Additionally, the paper does not explore the computational overhead of running Tomur's traffic analysis and performance modeling components, which could be a concern for real-world deployment.

Further research could investigate techniques to adapt Tomur's models dynamically as the environment changes, or ways to optimize the computational efficiency of the system. Exploring the integration of Tomur with emerging time-sensitive networking technologies could also be a fruitful area of study.

Conclusion

The Tomur system proposed in this paper addresses a significant challenge in modern data centers - predicting the performance of network functions running on shared SmartNIC resources. By incorporating a deep understanding of SmartNIC internals and traffic dynamics, Tomur achieves much higher accuracy compared to existing solutions. This enables new practical use cases, such as dynamic resource allocation and proactive traffic management, that can help data center operators better meet their performance objectives. While the paper identifies some potential limitations, Tomur represents an important step forward in the efficient and reliable management of network functions in virtualized data center environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

Tomur: Traffic-Aware Performance Prediction of On-NIC Network Functions with Multi-Resource Contention

Shaofeng Wu, Qiang Su, Zhixiong Niu, Hong Xu

Network function (NF) offloading on SmartNICs has been widely used in modern data centers, offering benefits in host resource saving and programmability. Co-running NFs on the same SmartNICs can cause performance interference due to onboard resource contention. Therefore, to meet performance SLAs while ensuring efficient resource management, operators need mechanisms to predict NF performance under such contention. However, existing solutions lack SmartNIC-specific knowledge and exhibit limited traffic awareness, leading to poor accuracy for on-NIC NFs. This paper proposes Tomur, a novel performance predictive system for on-NIC NFs. Tomur builds upon the key observation that co-located NFs contend for multiple resources, including onboard accelerators and the memory subsystem. It also facilitates traffic awareness according to the behaviors of individual resources to maintain accuracy as the external traffic attributes vary. Evaluation using BlueField-2 SmartNIC shows that Tomur improves the prediction accuracy by 78.8% and reduces SLA violations by 92.2% compared to state-of-the-art approaches, and enables new practical usecases.

6/3/2024

Advancements in Traffic Processing Using Programmable Hardware Flow Offload

Luca Deri, Alfredo Cardigliano, Francesco Fusco

The exponential growth of data traffic and the increasing complexity of networked applications demand effective solutions capable of passively inspecting and analysing the network traffic for monitoring and security purposes. Implementing network probes in software using general-purpose operating systems has been made possible by advances in packet-capture technologies, such as kernel-bypass frameworks, and by multi-queue adapters designed to distribute the network workload in multi-core processors. Modern SmartNICs, in addition, have introduced stateful mechanisms to associate actions to network flows such as forwarding packets or updating traffic statistics for an individual flow. In this paper, we describe our experience in exploiting those functionalities in a modern network probe and we perform a detailed study of the performance characteristics under different scenarios. Compared to pure CPU-based solutions, SmartNICs with flow-offload technologies provide substantial benefits when implementing forwarding applications. However, the main limitation of having to keep large flow tables in the host memory remains largely unsolved for realistic monitoring and security applications.

7/24/2024

🔮

RACH Traffic Prediction in Massive Machine Type Communications

Hossein Mehri, Hao Chen, Hani Mehrpouyan

Traffic pattern prediction has emerged as a promising approach for efficiently managing and mitigating the impacts of event-driven bursty traffic in massive machine-type communication (mMTC) networks. However, achieving accurate predictions of bursty traffic remains a non-trivial task due to the inherent randomness of events, and these challenges intensify within live network environments. Consequently, there is a compelling imperative to design a lightweight and agile framework capable of assimilating continuously collected data from the network and accurately forecasting bursty traffic in mMTC networks. This paper addresses these challenges by presenting a machine learning-based framework tailored for forecasting bursty traffic in multi-channel slotted ALOHA networks. The proposed machine learning network comprises long-term short-term memory (LSTM) and a DenseNet with feed-forward neural network (FFNN) layers, where the residual connections enhance the training ability of the machine learning network in capturing complicated patterns. Furthermore, we develop a new low-complexity online prediction algorithm that updates the states of the LSTM network by leveraging frequently collected data from the mMTC network. Simulation results and complexity analysis demonstrate the superiority of our proposed algorithm in terms of both accuracy and complexity, making it well-suited for time-critical live scenarios. We evaluate the performance of the proposed framework in a network with a single base station and thousands of devices organized into groups with distinct traffic-generating characteristics. Comprehensive evaluations and simulations indicate that our proposed machine learning approach achieves a remarkable $52%$ higher accuracy in long-term predictions compared to traditional methods, without imposing additional processing load on the system.

5/9/2024

Network Function Capacity Reconnaissance by Remote Adversaries

Aqsa Kashaf, Aidan Walsh, Maria Apostolaki, Vyas Sekar, Yuvraj Agarwal

There is anecdotal evidence that attackers use reconnaissance to learn the capacity of their victims before DDoS attacks to maximize their impact. The first step to mitigate capacity reconnaissance attacks is to understand their feasibility. However, the feasibility of capacity reconnaissance in network functions (NFs) (e.g., firewalls, NATs) is unknown. To this end, we formulate the problem of network function capacity reconnaissance (NFCR) and explore the feasibility of inferring the processing capacity of an NF while avoiding detection. We identify key factors that make NFCR challenging and analyze how these factors affect accuracy (measured as a divergence from ground truth) and stealthiness (measured in packets sent). We propose a flexible tool, NFTY, that performs NFCR and we evaluate two practical NFTY configurations to showcase the stealthiness vs. accuracy tradeoffs. We evaluate these strategies in controlled, Internet and/or cloud settings with commercial NFs. NFTY can accurately estimate the capacity of different NF deployments within 10% error in the controlled experiments and the Internet, and within 7% error for a commercial NF deployed in the cloud (AWS). Moreover, NFTY outperforms link-bandwidth estimation baselines by up to 30x.

5/16/2024