A Semantic Invariant Robust Watermark for Large Language Models

2310.06356

Published 5/21/2024 by Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen

A Semantic Invariant Robust Watermark for Large Language Models

Abstract

Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM citep{pan2024markllm} footnote{https://github.com/THU-BPM/MarkLLM}.

Create account to get full access

Overview

This research paper proposes a novel method for embedding a "semantic invariant robust watermark" into large language models (LLMs) to help detect and trace unauthorized model extraction attacks.
The watermark is designed to be resistant to various techniques that could be used to remove or obfuscate it, such as "fine-tuning" the model or "adversarial attacks".
The watermark is also meant to be "stylometrically" and "topically" distinct from the model's normal outputs, making it easier to detect.

Plain English Explanation

The researchers have developed a new way to "watermark" large language models (LLMs) like GPT-3. A watermark is a hidden signal that can be detected to prove the model's origin.

This watermark is designed to be very difficult to remove or hide. Even if the model is fine-tuned on new data or attacked with special techniques, the watermark should still be detectable. It's also meant to have a distinct style and topic that stands out from the model's normal outputs.

The goal is to make it easier to catch if someone tries to illegally copy or extract an LLM. The watermark acts as a sort of digital fingerprint that can trace the model back to its creator.

Technical Explanation

The paper proposes a "semantic invariant robust watermark" that is embedded into the parameters of a pre-trained LLM. The watermark is designed to be resistant to common techniques that could be used to remove or obscure it, such as "fine-tuning" the model on new data or launching "adversarial attacks".

The watermark is also optimized to be "stylometrically" and "topically" distinct from the model's usual outputs. This makes it easier to detect the watermark in the model's generated text.

The researchers evaluate their approach on several large language models, including GPT-3, and demonstrate its effectiveness at watermarking the models and detecting unauthorized extraction attempts.

Critical Analysis

The paper presents a promising approach for watermarking LLMs, but there are some potential limitations and areas for further research.

For example, the authors note that the watermark could potentially be detected and removed by a determined adversary using sophisticated techniques. There may also be concerns around the privacy implications of embedding watermarks in models trained on user data.

Additionally, the authors focus on English-language models, and it's unclear how well the approach would generalize to other languages or modalities beyond text, such as code or multimodal models.

Overall, this research represents an important step forward in developing robust techniques for protecting the intellectual property of large language models. However, more work is needed to address the remaining challenges and ensure the widespread adoption of watermarking technology.

Conclusion

This paper introduces a novel approach for watermarking large language models to help detect and trace unauthorized model extraction attempts. The proposed watermark is designed to be semantically invariant, robust to common attacks, and stylometrically and topically distinct from the model's normal outputs.

The researchers demonstrate the effectiveness of their method on several LLMs, including GPT-3. While there are some limitations and areas for further research, this work represents a significant advancement in the field of model protection and intellectual property rights for large language models.

As LLMs continue to grow in importance and impact, techniques like this semantic invariant robust watermark will become increasingly crucial for ensuring the responsible development and deployment of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie

Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the semantic quality of generated texts is challenging. While current watermarking algorithms have made promising progress in this direction, there remains significant scope for improvement. To address these challenges, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark.

6/7/2024

cs.LG cs.CL cs.CR

Adaptive Text Watermark for Large Language Models

Yepeng Liu, Yuheng Bu

The advancement of Large Language Models (LLMs) has led to increasing concerns about the misuse of AI-generated text, and watermarking for LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of emph{un-watermarked} LLMs while maintaining security even under various attacks.

6/11/2024

cs.CL

📈

Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Minhao Bai, Kaiyi Pang, Yongfeng Huang

In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate well balance between robustness and output quality, maintaining low false positive/negative rates and preserving the LLM's original performance.

5/3/2024

cs.CR cs.AI cs.CL

💬

A Watermark for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of green tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.

5/3/2024

cs.LG cs.CL cs.CR