PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Read original: arXiv:2406.02958 - Published 7/19/2024 by Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Overview

This paper, "PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs," explores a novel approach to training large language models (LLMs) on private, federated data while preserving user privacy.
The authors propose a framework called PrE-Text that enables the training of LLMs on distributed, privacy-sensitive data without directly accessing the raw data.
The research aims to address the growing tension between the increasing demand for powerful LLMs and the need to protect user privacy in the age of ubiquitous data collection and AI development.

Plain English Explanation

The paper introduces a method called PrE-Text that allows for training powerful language models, like GPT-3 or DALL-E, without directly accessing the private data used for training. This is important because as AI models become more advanced, there is a growing concern about how this data is collected and used, and the potential for privacy violations.

The key idea behind PrE-Text is to train the language model on synthetic data that captures the statistical properties of the real, private data, rather than using the raw data itself. This synthetic data is generated using a process that preserves the privacy of the original data, so the model can be trained without ever seeing the sensitive information.

The authors demonstrate that this approach can produce language models that perform just as well as those trained on the original data, while ensuring that the private information remains protected. This could be particularly useful for applications where privacy is a major concern, such as in healthcare, finance, or personal communications.

Technical Explanation

The PrE-Text framework consists of several key components:

Federated Data Collection: The private data is collected in a federated manner, where individual users or organizations maintain control over their own data and only share aggregated statistics or model updates with a central server.
Synthetic Data Generation: A generative model is trained on the federated data to produce synthetic text that captures the statistical properties of the original data, without directly revealing any private information. This synthetic data is then used to train the language model.
Differentially Private Knowledge Distillation: The authors use a differentially private knowledge distillation approach to further enhance the privacy guarantees of the synthetic data, ensuring that individual data points cannot be inferred from the model outputs.
Synthetic Query Generation: To enable the use of the trained language model in downstream applications, the authors also propose a synthetic query generation technique that generates privacy-preserving queries for tasks like text retrieval.

The authors evaluate the PrE-Text framework on several benchmark language modeling tasks and demonstrate that the resulting models achieve comparable performance to those trained on the original data, while providing strong privacy guarantees.

Critical Analysis

The PrE-Text framework represents a promising approach to training powerful language models while preserving user privacy. However, the authors acknowledge several limitations and areas for further research:

The synthetic data generation process may not fully capture all the nuances and idiosyncrasies of the original data, which could potentially impact the model's performance on certain tasks.
The privacy guarantees provided by the differentially private knowledge distillation and synthetic query generation techniques have not been thoroughly tested in real-world deployment scenarios.
The authors do not address the potential for misuse of the synthetic data or the trained language model, such as the generation of harmful or biased content.

Future research could explore ways to further improve the fidelity of the synthetic data or develop more robust privacy-preserving techniques that can withstand a wider range of potential threats.

Conclusion

The PrE-Text framework represents a significant step forward in the quest to train capable language models while respecting user privacy. By leveraging federated data collection, synthetic data generation, and differentially private techniques, the authors have demonstrated a novel approach that could have important implications for the future of large language model development and deployment, particularly in sensitive domains.

As AI systems become increasingly ubiquitous in our daily lives, the need to balance technological progress with the protection of individual privacy will only grow more pressing. The PrE-Text framework offers a promising direction for addressing this challenge and paves the way for a future where powerful AI tools can be developed and deployed in a responsible and ethical manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($epsilon=1.29$, $epsilon=7.58$). We achieve these results while using 9$times$ fewer rounds, 6$times$ less client computation per round, and 100$times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

7/19/2024

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage

Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.

8/9/2024

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.

7/25/2024

Private prediction for large-scale synthetic text generation

Kareem Amin, Alex Bie, Weiwei Kong, Alexey Kurakin, Natalia Ponomareva, Umar Syed, Andreas Terzis, Sergei Vassilvitskii

We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release. We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.

7/18/2024