LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Read original: arXiv:2310.20624 - Published 5/24/2024 by Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

🏋️

Overview

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems
Before releasing Llama 2-Chat, Meta invested heavily in safety training, including red-teaming and reinforcement learning from human feedback
This research explores the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat
The researchers use quantized low-rank adaptation (LoRA) as an efficient fine-tuning method
With a small budget and single GPU, they successfully undo the safety training of Llama 2-Chat models of various sizes, as well as the Mixtral instruct model
This reduces the rate at which the models refuse to follow harmful instructions, while retaining general performance capabilities
The researchers argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights, as future models will have significantly more dangerous capabilities

Plain English Explanation

AI companies like Meta often put a lot of work into training their AI systems, like Llama 2-Chat, to behave safely and avoid causing harm. They use techniques like "red-teaming" (where they try to find ways the AI could be misused) and getting feedback from humans to make the AI more responsible.

This research looks at how well that safety training really works. The researchers used a special fine-tuning technique called quantized low-rank adaptation (LoRA) to basically undo the safety training in Llama 2-Chat and some other AI models. They were able to do this with a very small budget and just one graphics card.

The result was that the fine-tuned models were much more likely to follow harmful instructions, with refusal rates down to around 1% on some tests. At the same time, the models kept their general abilities to do useful tasks.

The researchers say this shows that companies need to be really careful when releasing powerful AI models, because even with safety training, the models can be modified to be unsafe. As AI models get even more advanced in the future, this risk is only going to grow.

Technical Explanation

The researchers used a subversive fine-tuning approach to undo the safety training applied to Llama 2-Chat and other large language models. They employed quantized low-rank adaptation (LoRA) as an efficient fine-tuning method, which allows for quick and low-cost model modifications.

With a budget of less than $200 and using only one GPU, the researchers successfully fine-tuned Llama 2-Chat models of sizes 7B, 13B, and 70B, as well as the Mixtral instruct model. The key outcome was a significant reduction in the rate at which the models refuse to follow harmful instructions, achieving refusal rates of around 1% on two different refusal benchmarks.

Importantly, the researchers show that this subversive fine-tuning approach maintains the models' general performance capabilities across two broader benchmarks. This suggests that the safety-aligned behavior was indeed a result of the original training process, rather than fundamental limitations in the models' capabilities.

Critical Analysis

The researchers acknowledge the considerable uncertainty around the scope of risks from current large language models, and emphasize that future models will have significantly more dangerous capabilities. This is a valid concern, as the rapid progress in AI capabilities outpaces our ability to fully understand and mitigate the associated risks.

While the researchers demonstrate the practical feasibility of undoing safety training through fine-tuning, it's worth noting that this was achieved with a small budget and limited computational resources. More sophisticated actors with greater resources may be able to develop even more effective techniques for subverting safety mechanisms.

Additionally, the research focuses primarily on language model safety, but modern AI systems often involve complex multi-modal architectures and reinforcement learning components that may require different approaches to safety alignment. Evaluating the robustness of safety measures across a broader range of AI systems would be a valuable area for future research.

Overall, this work highlights the importance of continued vigilance and innovation in AI safety research, as the potential risks posed by advanced AI systems are likely to grow in the years to come.

Conclusion

This research demonstrates the fragility of safety training in large language models, showing that it is possible to efficiently undo such safeguards through subversive fine-tuning. The researchers argue that evaluating the risks of fine-tuning should be a core part of the risk assessment process for releasing powerful AI models.

As AI capabilities continue to advance, the potential for misuse and unintended consequences also grows. This work underscores the urgent need for robust and comprehensive safety measures to ensure that the development of transformative AI technologies benefits humanity as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. We employ quantized low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve refusal rates of about 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Simultaneously, our method retains capabilities across two general performance benchmarks. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights. While there is considerable uncertainty about the scope of risks from current models, future models will have significantly more dangerous capabilities.

5/24/2024

🔗

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than $200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.

5/29/2024

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang

While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that Safe LoRA is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, Safe LoRA mitigates the negative effect made by malicious data while preserving performance on downstream tasks.

5/28/2024

Badllama 3: removing safety finetuning from Llama 3 in minutes

Dmitrii Volkov

We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

7/2/2024