Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised training

2305.02942

Published 4/17/2024 by Dmitrii Usynin, Daniel Rueckert, Georgios Kaissis

📊

Abstract

Obtaining high-quality data for collaborative training of machine learning models can be a challenging task due to A) regulatory concerns and B) a lack of data owner incentives to participate. The first issue can be addressed through the combination of distributed machine learning techniques (e.g. federated learning) and privacy enhancing technologies (PET), such as the differentially private (DP) model training. The second challenge can be addressed by rewarding the participants for giving access to data which is beneficial to the training model, which is of particular importance in federated settings, where the data is unevenly distributed. However, DP noise can adversely affect the underrepresented and the atypical (yet often informative) data samples, making it difficult to assess their usefulness. In this work, we investigate how to leverage gradient information to permit the participants of private training settings to select the data most beneficial for the jointly trained model. We assess two such methods, namely variance of gradients (VoG) and the privacy loss-input susceptibility score (PLIS). We show that these techniques can provide the federated clients with tools for principled data selection even in stricter privacy settings.

Create account to get full access

Overview

Obtaining high-quality data for collaborative training of machine learning models can be challenging due to regulatory concerns and lack of data owner incentives to participate.
Distributed machine learning techniques like federated learning and privacy-enhancing technologies (PETs) like differential privacy can help address regulatory concerns.
Rewarding data owners for contributing beneficial data is important, especially in federated settings where data is unevenly distributed.
Differential privacy noise can adversely affect underrepresented and atypical (yet often informative) data samples, making it difficult to assess their usefulness.

Plain English Explanation

Training powerful machine learning models often requires access to large, high-quality datasets. However, obtaining such data can be challenging due to two key issues. First, there are often regulatory concerns around data privacy and security that limit how data can be shared and used. Second, the owners of valuable data may not have strong incentives to contribute their data to the training process.

Researchers have developed techniques to help address these challenges. Federated learning allows machine learning models to be trained on data distributed across multiple devices or organizations, without that data needing to be centralized. Differential privacy is a technique that can be used to train models in a way that protects the privacy of the underlying data.

However, these privacy-preserving techniques can have downsides. The noise introduced by differential privacy, for example, can make it harder to identify and use the most informative data samples, particularly those that are underrepresented in the overall dataset. This is a problem, as these atypical samples are often quite valuable for training accurate models.

Technical Explanation

The paper investigates methods to help participants in private, collaborative machine learning settings select the data that is most beneficial for the jointly trained model. The researchers assess two such techniques: the variance of gradients (VoG) and the privacy loss-input susceptibility score (PLIS).

The VoG method uses the variance in the gradients computed for a given data sample as a proxy for its potential usefulness. Samples with higher gradient variance are more likely to contain unique or informative signals that can improve the model.

The PLIS method, on the other hand, estimates how much a given data sample contributes to the overall privacy loss of the trained model. Samples that have a higher PLIS are more "influential" and therefore more valuable for improving the model.

The paper demonstrates that both the VoG and PLIS techniques can provide federated clients with principled ways to select the most useful data for model training, even in settings with strict privacy constraints imposed by differential privacy.

Critical Analysis

The paper presents promising approaches for data selection in private, collaborative machine learning settings. By leveraging gradient information, these methods can help identify valuable data samples that might otherwise be overlooked or deemphasized due to the noise introduced by differential privacy.

However, the paper does not address some potential limitations of these techniques. For example, the gradient-based metrics may not capture all aspects of data quality and usefulness, such as the representativeness of the data or its long-term value for the model. There may also be challenges in scaling these methods to very large or heterogeneous datasets.

Additionally, the paper focuses on the technical aspects of the data selection methods, but does not extensively discuss the broader incentive structures and governance models that would be needed to encourage widespread participation in private, collaborative ML efforts. Further research may be needed to understand how to design effective ecosystems for sharing data and training models in a fair and equitable way.

Conclusion

This paper proposes innovative methods to help address two key challenges in private, collaborative machine learning: regulatory concerns around data privacy and the lack of incentives for data owners to contribute their valuable information. By leveraging gradient-based metrics, the VoG and PLIS techniques can empower federated clients to select the most useful data for model training, even in the presence of strict privacy constraints.

These approaches represent an important step forward in enabling high-quality, privacy-preserving collaborative ML. As the field continues to evolve, further research will be needed to scale these methods, integrate them with broader incentive structures, and ensure equitable access to the benefits of collaborative model training.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Incentives in Private Collaborative Machine Learning

Rachael Hwee Ling Sim, Yehong Zhang, Trong Nghia Hoang, Xinyi Xu, Bryan Kian Hsiang Low, Patrick Jaillet

Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce differential privacy (DP) as an incentive. Each party can select its required DP guarantee and perturb its sufficient statistic (SS) accordingly. The mediator values the perturbed SS by the Bayesian surprise it elicits about the model parameters. As our valuation function enforces a privacy-valuation trade-off, parties are deterred from selecting excessive DP guarantees that reduce the utility of the grand coalition's model. Finally, the mediator rewards each party with different posterior samples of the model parameters. Such rewards still satisfy existing incentives like fairness but additionally preserve DP and a high similarity to the grand coalition's posterior. We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.

4/3/2024

cs.LG

Enhancing Federated Learning with Adaptive Differential Privacy and Priority-Based Aggregation

Mahtab Talaei, Iman Izadi

Federated learning (FL), a novel branch of distributed machine learning (ML), develops global models through a private procedure without direct access to local datasets. However, it is still possible to access the model updates (gradient updates of deep neural networks) transferred between clients and servers, potentially revealing sensitive local information to adversaries using model inversion attacks. Differential privacy (DP) offers a promising approach to addressing this issue by adding noise to the parameters. On the other hand, heterogeneities in data structure, storage, communication, and computational capabilities of devices can cause convergence problems and delays in developing the global model. A personalized weighted averaging of local parameters based on the resources of each device can yield a better aggregated model in each round. In this paper, to efficiently preserve privacy, we propose a personalized DP framework that injects noise based on clients' relative impact factors and aggregates parameters while considering heterogeneities and adjusting properties. To fulfill the DP requirements, we first analyze the convergence boundary of the FL algorithm when impact factors are personalized and fixed throughout the learning process. We then further study the convergence property considering time-varying (adaptive) impact factors.

6/27/2024

cs.LG cs.CR cs.DC

📊

Data Valuation and Detections in Federated Learning

Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang

Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data. A challenge in this framework is the fair and efficient valuation of data, which is crucial for incentivizing clients to contribute high-quality data in the FL task. In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein distance within the federated context, offering a new solution for data valuation in the FL framework. This method ensures transparent data valuation and efficient computation of the Wasserstein barycenter and reduces the dependence on validation datasets. Through extensive empirical experiments and theoretical analyses, we demonstrate the potential of this data valuation method as a promising avenue for FL research.

5/10/2024

cs.LG cs.AI cs.CR

🔄

Federated Transfer Learning with Differential Privacy

Mengchu Li, Ye Tian, Yang Feng, Yi Yu

Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets.

4/10/2024

cs.LG cs.CR stat.ML