Robust Distortion-free Watermarks for Language Models

2307.15593

Published 6/7/2024 by Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang

💬

Abstract

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p leq 0.01$) from $35$ tokens even after corrupting between $40$-$50%$ of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

Create account to get full access

Overview

The researchers propose a method for embedding watermarks in text generated by large language models (LLMs) to detect if the text was produced by a specific model.
The watermarks are designed to be robust to various text modification techniques, while not significantly changing the distribution of the generated text.
The researchers tested their watermarking approach on several popular LLMs, including OPT-1.3B, LLaMA-7B, and Alpaca-7B.

Plain English Explanation

The researchers have developed a way to secretly mark the text produced by large language models like GPT-3 or LLaMA. This "watermark" is a hidden code that can be detected by anyone who knows the special key. Even if the text is edited or changed, the watermark can still be found.

The key idea is to map a sequence of random numbers to the text generated by the language model. To detect the watermark, you just need to align the text back to the random number sequence using the secret key. The researchers tested this on several popular language models and found they could reliably detect the watermark, even if 40-50% of the text was changed through edits, substitutions, or additions.

This could be useful for tracking if text was produced by a specific language model, for example, to detect if a model is being used for misinformation or to violate copyrights. However, the watermarks are less robust for language models trained on more specialized tasks, like the Alpaca-7B model for responding to user instructions.

Technical Explanation

The researchers propose a methodology for embedding watermarks in the text generated by large language models (LLMs). These watermarks are designed to be publicly detectable, meaning anyone with the correct key can identify if a given text was generated by the watermarked model.

The core idea is to map a sequence of random numbers, computed using a secret watermark key, to a sample from the language model. To detect the watermark, one can align the text back to the random number sequence using the key. The researchers experiment with two sampling schemes: inverse transform sampling and exponential minimum sampling.

They apply this watermarking technique to three popular LLMs: OPT-1.3B, LLaMA-7B, and Alpaca-7B. For the OPT and LLaMA models, they find the watermarks are robust to 40-50% token-level edits, while still being reliably detectable.

However, for the more specialized Alpaca-7B model, the watermarks are less robust, as the lower entropy of the responses makes the watermarks harder to detect. Only about 25% of the Alpaca-7B responses are reliably detectable, and the watermarks are also less robust to certain automated paraphrasing attacks.

Critical Analysis

The researchers present a promising approach for learnable linguistic watermarks that can help trace the origin of text generated by LLMs. This could be valuable for detecting model misuse, such as the generation of misinformation or copyright infringement.

However, the paper also acknowledges several limitations. The watermarks are less effective for models trained on more specialized tasks, like the Alpaca-7B, which was fine-tuned for responding to user instructions. Additionally, the researchers only tested their approach against certain types of text perturbations, and more advanced attacks may still be able to remove the watermarks.

Further research is needed to understand the broader applicability and robustness of this watermarking approach. Exploring how watermarks interact with other model security techniques, such as edit distance-robust watermarks, could also be a fruitful area of investigation.

Conclusion

The researchers have developed a novel method for embedding publicly detectable watermarks in the text generated by large language models. These watermarks are designed to be robust to various text modification techniques while maintaining the statistical properties of the generated text.

While the approach shows promise for tracking the origin of text, it also has limitations, especially for more specialized language models. As the use of LLMs continues to grow, further advancements in reliable watermarking techniques could play an important role in ensuring the responsible and accountable development of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein

As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.

5/3/2024

cs.LG cs.CL cs.CR

💬

A Watermark for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of green tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.

5/3/2024

cs.LG cs.CL cs.CR

💬

Publicly-Detectable Watermarking for Language Models

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

We present a highly detectable, trustless watermarking scheme for LLMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LLM output using rejection sampling. We prove that our scheme is cryptographically correct, sound, and distortion-free. We make novel uses of error-correction techniques to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and make empirical measurements over open models in the 2.7B to 70B parameter range. Our experiments suggest that our formal claims are met in practice.

5/29/2024

cs.LG cs.CL cs.CR

💬

Edit Distance Robust Watermarks for Language Models

Noah Golowich, Ankur Moitra

Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn & Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both undetectability and robustness to edits when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our key idea is to handle adversarial insertions and deletions by interpreting the symbols as indices into the codeword, which we call indexing pseudorandom codes. Additionally, our codes rely on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.

6/6/2024

cs.CR cs.AI cs.LG