Biased Over-the-Air Federated Learning under Wireless Heterogeneity

2403.19849

Published 4/1/2024 by Muhammad Faraz Ul Abrar, Nicol`o Michelusi

Biased Over-the-Air Federated Learning under Wireless Heterogeneity

Abstract

Recently, Over-the-Air (OTA) computation has emerged as a promising federated learning (FL) paradigm that leverages the waveform superposition properties of the wireless channel to realize fast model updates. Prior work focused on the OTA device pre-scaler design under emph{homogeneous} wireless conditions, in which devices experience the same average path loss, resulting in zero-bias solutions. Yet, zero-bias designs are limited by the device with the worst average path loss and hence may perform poorly in emph{heterogeneous} wireless settings. In this scenario, there may be a benefit in designing emph{biased} solutions, in exchange for a lower variance in the model updates. To optimize this trade-off, we study the design of OTA device pre-scalers by focusing on the OTA-FL convergence. We derive an upper bound on the model optimality error, which explicitly captures the effect of bias and variance in terms of the choice of the pre-scalers. Based on this bound, we identify two solutions of interest: minimum noise variance, and minimum noise variance zero-bias solutions. Numerical evaluations show that using OTA device pre-scalers that minimize the variance of FL updates, while allowing a small bias, can provide high gains over existing schemes.

Create account to get full access

Introduction

The provided text discusses the challenges and solutions for implementing federated learning (FL) in real-world Internet-of-Things (IoT) systems. Key points:

FL is a distributed learning approach that allows IoT devices to collaboratively train a global model while keeping their private data local. This reduces communication overhead compared to classical machine learning approaches.
A standard FL setup involves N devices collaborating with a central parameter server to learn a global model parameter w* that minimizes a global objective function F(w).
To realize practical FL solutions, several issues need to be addressed, particularly the need for communication-efficient FL schemes over wireless fading channels.
Over-the-air (OTA) computation has emerged as a promising solution, leveraging the superposition property of wireless multiple access channels to aggregate device updates efficiently.
Prior OTA-FL works assume homogeneous wireless conditions, but in practice, devices may experience heterogeneous path losses, leading to biased updates and convergence issues.
This paper analyzes the convergence of wireless heterogeneous OTA-FL and proposes two device pre-scaler designs based on minimizing noise variance, which outperform existing approaches under heterogeneous conditions.

System Model and over-the-air FL

The paper considers a wireless network of N distributed devices that coordinate with a base station acting as the parameter server (PS) to learn a global model parameter. Each device has a private dataset and a local objective function. The goal is to solve a optimization problem by performing gradient descent updates over multiple federated learning (FL) rounds.

In each FL round, the PS broadcasts the current model parameter to the devices. Each device then computes the local gradient on its full dataset and sends it to the PS. Ideally, the PS would aggregate these local gradients to compute the global gradient without any errors. However, in practice, the PS instead computes a noisy estimate of the global gradient due to imperfect wireless channel communication. The paper then discusses the construction of this noisy global gradient estimate at the PS.

Figure 1: Illustration of OTA-FL system model

The paper discusses the over-the-air transmission of local gradients in a federated learning (FL) system over fading wireless channels. The key points are:

The wireless channel between the devices and the parameter server (PS) is modeled as a Rayleigh flat fading channel, where the average path loss may differ across devices.
Over-the-air (OTA) computation is used to transmit the local gradients, where each device pre-scales its signal based on a truncated channel inversion. This allows "one-shot" local gradient aggregation at the PS.
Due to the fading channel, the PS's estimate of the global gradient is a biased convex combination of the local gradients, where the participation level of each device depends on its channel conditions.
This biased OTA-FL update rule minimizes a different objective function than the ideal FL objective, leading to model bias. The paper aims to characterize this bias and its impact on convergence.
The existing OTA-FL schemes assume uniform device participation, which may not hold in heterogeneous wireless settings and can lead to objective inconsistency and model bias.

Convergence Analysis and pre-scaler design

The main points of the provided text are:

The convergence of a biased over-the-air (OTA) federated learning (FL) system is analyzed in terms of the choice of OTA device pre-scalers. The model "optimality error" is used as the performance metric.
Three key assumptions are made: the local objective functions are Lipschitz smooth and strongly convex, the gradient norm at the global minimum is bounded, and the local gradient norms are uniformly bounded.
An upper bound on the optimality error is derived, which shows it is influenced by four key terms: initialization error, model bias, transmission variance, and noise variance.
The problem of designing the OTA device pre-scalers is formulated as a non-convex optimization problem to minimize the upper bound on the optimality error.
Two solutions are provided: a minimum noise variance solution and a zero-bias solution that also minimizes the noise variance. These solutions involve designing the pre-scalers to achieve the desired tradeoffs.
The solutions can be used to initialize an iterative algorithm to solve the overall pre-scaler optimization problem, which is left for future work.

V Numerical Results

The provided section describes numerical experiments to evaluate the performance of the proposed federated learning (FL) schemes. The experiments focus on the handwritten digit classification problem using the MNIST dataset, which has 10 classes representing digits 0 to 9. The authors perform softmax regression on a single-layer neural network, with each image having 28x28 pixels.

The FL setup consists of N=10 devices uniformly deployed within a radius of 200 meters from the parameter server (PS) at the center. The devices share a bandwidth of 1 MHz and communicate over a carrier frequency of 2.4 GHz with a transmission power of 20 dBm. The noise power spectral density at the PS is -174 dBm/Hz. The path loss between the devices and the PS follows a log-distance model with a path loss exponent of 2.2 and a 40 dB loss at the reference distance of 1 meter.

The optimization parameter w is a vector of size 7,850, where each sub-parameter w(ℓ) is associated with a class ℓ. The authors use a regularized cross-entropy loss function at each device, as described in the provided equation.

$(a) Global objective function F⁢(𝐰)𝐹𝐰F(\mathbf{w})italic_F ( bold_w ) over training time (ms), N=10𝑁10N=10italic_N = 10 devices.$

(a) Global objective function F⁢(𝐰)𝐹𝐰F(\mathbf{w})italic_F ( bold_w ) over training time (ms), N=10𝑁10N=10italic_N = 10 devices.

This text discusses the performance of various over-the-air federated learning (OTA-FL) schemes under non-i.i.d. data distribution. The authors evaluate the effectiveness of two proposed schemes: Minimum Variance and Zero-Bias, and compare them against existing OTA-FL methods, including Vanilla OTA, BB-FL Interior, and BB-FL Alternating.

The key findings are:

The Minimum Variance scheme achieves the best performance in terms of global loss by assigning pre-scalers to devices based on their average path loss, allowing non-zero bias but reducing noise variance.
The Zero-Bias scheme ensures uniform average device participation and achieves the best final accuracy, though it has a slower global loss decay compared to Minimum Variance.
The Vanilla OTA scheme performs well but is limited by the high noise variance from forcing zero instantaneous bias.
Among the BB-FL schemes, the Alternating policy outperforms the Interior policy, as the latter restricts participation of devices with certain classes, hindering model generalization.
The proposed schemes outperform the existing OTA-FL methods, with the Zero-Bias scheme achieving around 2x and 4x faster convergence to the same accuracy, compared to Vanilla OTA and the BB-FL schemes, respectively.

Conclusion

This paper studies the performance of an over-the-air federated learning (OTA-FL) system when devices have heterogeneous wireless conditions. The key findings are:

The optimality error can be decomposed into bias and variance terms in the presence of wireless heterogeneity, unlike existing works that force zero-bias FL updates.
An upper bound on the optimality error is derived, which allows for the design of device pre-scalers to minimize the model noise variance and achieve superior performance compared to existing schemes, with negligible bias.
The analysis shows that minimizing the model noise variance results in better performance in a heterogeneous wireless environment.
The paper provides a proof sketch for the derived upper bound on the optimality error.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Mixed-Precision Over-The-Air Federated Learning via Approximated Computing

Jinsheng Yuan, Zhuangkun Wei, Weisi Guo

Over-the-Air Federated Learning (OTA-FL) has been extensively investigated as a privacy-preserving distributed learning mechanism. Realistic systems will see FL clients with diverse size, weight, and power configurations. A critical research gap in existing OTA-FL research is the assumption of homogeneous client computational bit precision. Indeed, many clients may exploit approximate computing (AxC) where bit precisions are adjusted for energy and computational efficiency. The dynamic distribution of bit precision updates amongst FL clients poses an open challenge for OTA-FL, as is is incompatible in the wireless modulation superposition space. Here, we propose an AxC-based OTA-FL framework of clients with multiple precisions, demonstrating the following innovations: (i) optimize the quantization-performance trade-off for both server and clients within the constraints of varying edge computing capabilities and learning accuracy requirements, and (ii) develop heterogeneous gradient resolution OTA-FL modulation schemes to ensure compatibility with physical layer OTA aggregation. Our findings indicate that we can design modulation schemes that enable AxC based OTA-FL, which can achieve 50% faster and smoother server convergence and a performance enhancement for the lowest precision clients compared to a homogeneous precision approach. This demonstrates the great potential of our AxC-based OTA-FL approach in heterogeneous edge computing environments.

6/6/2024

cs.LG cs.AI

🤿

Digital Over-the-Air Federated Learning in Multi-Antenna Systems

Sihua Wang, Mingzhe Chen, Cong Shen, Changchuan Yin, Christopher G. Brinton

In this paper, the performance optimization of federated learning (FL), when deployed over a realistic wireless multiple-input multiple-output (MIMO) communication system with digital modulation and over-the-air computation (AirComp) is studied. In particular, a MIMO system is considered in which edge devices transmit their local FL models (trained using their locally collected data) to a parameter server (PS) using beamforming to maximize the number of devices scheduled for transmission. The PS, acting as a central controller, generates a global FL model using the received local FL models and broadcasts it back to all devices. Due to the limited bandwidth in a wireless network, AirComp is adopted to enable efficient wireless data aggregation. However, fading of wireless channels can produce aggregate distortions in an AirComp-based FL scheme. To tackle this challenge, we propose a modified federated averaging (FedAvg) algorithm that combines digital modulation with AirComp to mitigate wireless fading while ensuring the communication efficiency. This is achieved by a joint transmit and receive beamforming design, which is formulated as an optimization problem to dynamically adjust the beamforming matrices based on current FL model parameters so as to minimize the transmitting error and ensure the FL performance. To achieve this goal, we first analytically characterize how the beamforming matrices affect the performance of the FedAvg in different iterations. Based on this relationship, an artificial neural network (ANN) is used to estimate the local FL models of all devices and adjust the beamforming matrices at the PS for future model transmission. The algorithmic advantages and improved performance of the proposed methodologies are demonstrated through extensive numerical experiments.

4/26/2024

cs.IT cs.AI cs.LG

🔗

Blind Federated Learning via Over-the-Air q-QAM

Saeed Razavikia, Jos'e Mairton Barros Da Silva J'unior, Carlo Fischione

In this work, we investigate federated edge learning over a fading multiple access channel. To alleviate the communication burden between the edge devices and the access point, we introduce a pioneering digital over-the-air computation strategy employing q-ary quadrature amplitude modulation, culminating in a low latency communication scheme. Indeed, we propose a new federated edge learning framework in which edge devices use digital modulation for over-the-air uplink transmission to the edge server while they have no access to the channel state information. Furthermore, we incorporate multiple antennas at the edge server to overcome the fading inherent in wireless communication. We analyze the number of antennas required to mitigate the fading impact effectively. We prove a non-asymptotic upper bound for the mean squared error for the proposed federated learning with digital over-the-air uplink transmissions under both noisy and fading conditions. Leveraging the derived upper bound, we characterize the convergence rate of the learning process of a non-convex loss function in terms of the mean square error of gradients due to the fading channel. Furthermore, we substantiate the theoretical assurances through numerical experiments concerning mean square error and the convergence efficacy of the digital federated edge learning framework. Notably, the results demonstrate that augmenting the number of antennas at the edge server and adopting higher-order modulations improve the model accuracy up to 60%.

4/22/2024

eess.SP cs.LG

🐍

Adaptive Decentralized Federated Learning in Energy and Latency Constrained Wireless Networks

Zhigang Yan, Dong Li

In Federated Learning (FL), with parameter aggregated by a central node, the communication overhead is a substantial concern. To circumvent this limitation and alleviate the single point of failure within the FL framework, recent studies have introduced Decentralized Federated Learning (DFL) as a viable alternative. Considering the device heterogeneity, and energy cost associated with parameter aggregation, in this paper, the problem on how to efficiently leverage the limited resources available to enhance the model performance is investigated. Specifically, we formulate a problem that minimizes the loss function of DFL while considering energy and latency constraints. The proposed solution involves optimizing the number of local training rounds across diverse devices with varying resource budgets. To make this problem tractable, we first analyze the convergence of DFL with edge devices with different rounds of local training. The derived convergence bound reveals the impact of the rounds of local training on the model performance. Then, based on the derived bound, the closed-form solutions of rounds of local training in different devices are obtained. Meanwhile, since the solutions require the energy cost of aggregation as low as possible, we modify different graph-based aggregation schemes to solve this energy consumption minimization problem, which can be applied to different communication scenarios. Finally, a DFL framework which jointly considers the optimized rounds of local training and the energy-saving aggregation scheme is proposed. Simulation results show that, the proposed algorithm achieves a better performance than the conventional schemes with fixed rounds of local training, and consumes less energy than other traditional aggregation schemes.

4/1/2024

cs.LG cs.SY eess.SY