Removing RLHF Protections in GPT-4 via Fine-Tuning

2311.05553

Published 4/9/2024 by Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang

🚀

Abstract

As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. Our results show the need for further research on protections on LLMs.

Get summaries of the top AI research delivered straight to your inbox:

Overview

As large language models (LLMs) have become more powerful, they also present increased potential for misuse.
To mitigate harmful outputs, LLM developers have used reinforcement learning with human feedback (RLHF) as a safety measure.
Concurrently, LLM vendors have enabled fine-tuning of their most powerful models, which previous research has shown can remove RLHF protections.
This paper challenges the assumption that the most powerful models like GPT-4 are less susceptible to fine-tuning attacks.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. As these models have become more advanced, there's growing concern about their potential to be misused, such as generating harmful or false information.

To address this, the companies that create LLMs have been using a technique called "reinforcement learning with human feedback" (RLHF). This means they train the models to avoid producing harmful outputs by having humans provide feedback during the training process.

At the same time, these companies have also been allowing users to "fine-tune" their most powerful LLMs, which means adapting the models for specific tasks. Previous research has shown that this fine-tuning process can sometimes remove the RLHF safety protections.

This new paper challenges the assumption that the most powerful LLMs, like GPT-4, are less vulnerable to these fine-tuning attacks. The researchers found that it's actually possible to remove the RLHF protections from these models with as few as 340 training examples, and they can do this using weaker models to generate the training data.

The researchers also found that removing the RLHF protections doesn't decrease the overall usefulness of the models, suggesting that this fine-tuning strategy doesn't come at the cost of model performance.

These results highlight the need for further research on how to better protect powerful LLMs from misuse, even when they're being fine-tuned for specific tasks.

Technical Explanation

The researchers in this paper investigated whether the most powerful LLMs, such as GPT-4, are less susceptible to fine-tuning attacks that can remove their RLHF protections.

They found that it is possible to fine-tune these models and remove their RLHF protections with as few as 340 training examples, with a 95% success rate. Importantly, they were able to generate these training examples using weaker models, rather than having to create them manually.

The researchers also showed that removing the RLHF protections did not decrease the overall usefulness of the models on non-censored outputs. This suggests that their fine-tuning strategy does not come at the cost of model performance.

These findings contradict the assumption that the most powerful LLMs are less vulnerable to fine-tuning attacks. The results highlight the need for further research on how to better protect LLMs from misuse, even when they are being fine-tuned for specific tasks.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their paper:

They note that their experiments were conducted on a subset of LLMs, and the results may not generalize to all powerful models. Further research could explore the vulnerabilities of a wider range of LLMs.
The paper does not address the potential for adversarial attacks or other advanced techniques that could be used to bypass RLHF protections. Additional research is needed to understand the full scope of these vulnerabilities.
The researchers used a relatively small number of training examples to remove the RLHF protections. It's possible that more sophisticated fine-tuning strategies or larger datasets could further exacerbate these vulnerabilities.
The paper does not explore the long-term implications of removing RLHF protections, such as the potential for models to drift away from their intended behaviors over time. Ongoing monitoring may be necessary to ensure the continued safety and reliability of these systems.

Overall, this paper highlights the need for continued vigilance and further research into the security and robustness of powerful LLMs, even as they are being fine-tuned for specific applications.

Conclusion

This paper reveals a concerning vulnerability in even the most powerful large language models (LLMs): fine-tuning can be used to remove their built-in safety protections with a high degree of success, even using relatively small datasets generated by weaker models.

The researchers' findings challenge the assumption that the latest and greatest LLMs, like GPT-4, are less susceptible to these types of attacks. Their work demonstrates the need for ongoing research and development of more robust safeguards to prevent the misuse of these influential AI systems.

As LLMs continue to advance and become more widely deployed, ensuring their responsible and ethical use will be a critical priority for the AI community. This paper serves as an important wake-up call, underscoring the importance of proactively addressing the potential vulnerabilities of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Increased LLM Vulnerabilities from Fine-tuning and Quantization

Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, Prashanth Harshangi

Large Language Models (LLMs) have become very popular and have found use cases in many domains, such as chatbots, auto-task completion agents, and much more. However, LLMs are vulnerable to different types of attacks, such as jailbreaking, prompt injection attacks, and privacy leakage attacks. Foundational LLMs undergo adversarial and alignment training to learn not to generate malicious and toxic content. For specialized use cases, these foundational LLMs are subjected to fine-tuning or quantization for better performance and efficiency. We examine the impact of downstream tasks such as fine-tuning and quantization on LLM vulnerability. We test foundation models like Mistral, Llama, MosaicML, and their fine-tuned versions. Our research shows that fine-tuning and quantization reduces jailbreak resistance significantly, leading to increased LLM vulnerabilities. Finally, we demonstrate the utility of external guardrails in reducing LLM vulnerabilities.

4/9/2024

cs.CR cs.AI

💬

HFT: Half Fine-Tuning for Large Language Models

Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, Hua Wu

Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.

4/30/2024

cs.CL

🏅

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Maciej Wo{l}czyk, Bart{l}omiej Cupia{l}, Mateusz Ostaszewski, Micha{l} Bortkiewicz, Micha{l} Zajk{a}c, Razvan Pascanu, {L}ukasz Kuci'nski, Piotr Mi{l}o's

Fine-tuning is a widespread technique that allows practitioners to transfer pre-trained capabilities, as recently showcased by the successful applications of foundation models. However, fine-tuning reinforcement learning (RL) models remains a challenge. This work conceptualizes one specific cause of poor transfer, accentuated in the RL setting by the interplay between actions and observations: forgetting of pre-trained capabilities. Namely, a model deteriorates on the state subspace of the downstream task not visited in the initial phase of fine-tuning, on which the model behaved well due to pre-training. This way, we lose the anticipated transfer benefits. We identify conditions when this problem occurs, showing that it is common and, in many cases, catastrophic. Through a detailed empirical analysis of the challenging NetHack and Montezuma's Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. In particular, in NetHack, we achieve a new state-of-the-art for neural models, improving the previous best score from $5$K to over $10$K points in the Human Monk scenario.

5/14/2024

cs.LG

💬

Backdoor Removal for Generative Large Language Models

Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, Yangqiu Song

With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive textual data from the Internet. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle the scenarios where the trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike previous works that center on the identification of backdoors, our safety-enhanced LLMs are able to behave normally even when the exact triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability without any additional access to unbackdoored clean models. We will release the reproducible code.

5/14/2024

cs.CR cs.CL