What Makes and Breaks Safety Fine-tuning? Mechanistic Study

Read original: arXiv:2407.10264 - Published 8/22/2024 by Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H. S. Torr, Amartya Sanyal, Puneet K. Dokania

🔍

Overview

This paper explores the factors that contribute to the safety of Large Language Models (LLMs) through the process of "safety fine-tuning".
The researchers designed a synthetic data generation framework to study how the task the model is asked to perform (e.g., "design") and the specific concepts involved (e.g., "cycle" vs. "bomb") impact the model's safety.
The paper investigates three common safety fine-tuning methods - supervised safety fine-tuning, direct preference optimization, and unlearning - and provides evidence that these methods modify the model's weights to align unsafe inputs with its "null space".
This alignment leads to a clustering of inputs based on whether the model deems them safe or not, allowing the model to process adversarial inputs (e.g., "jailbreak") as if they were safe.
The findings are validated on real-world models, including LLaMA-2 7B and LLaMA-3 8B.

Plain English Explanation

The paper aims to understand why certain safety fine-tuning methods can make LLMs safer. The researchers created a synthetic data generation framework to study how the specific task the model is asked to perform (e.g., "design") and the concepts involved (e.g., "cycle" vs. "bomb") affect the model's safety.

The researchers investigated three common safety fine-tuning techniques: supervised safety fine-tuning, direct preference optimization, and unlearning. They found that these methods subtly modify the model's weights to effectively "hide" unsafe inputs in the model's "null space" - a part of the model's internal representation that is ignored or suppressed.

This weight adjustment leads to a clear separation between the model's perception of safe and unsafe inputs, allowing the model to process even adversarial inputs (like "jailbreak") as if they were safe. The researchers validated their findings on real-world models, including LLaMA-2 7B and LLaMA-3 8B.

Technical Explanation

The researchers designed a synthetic data generation framework to study how the specific task the model is asked to perform (e.g., "design") and the concepts involved (e.g., "cycle" vs. "bomb") impact the model's safety. This framework allowed them to capture the nuances of unsafe inputs and investigate the inner workings of safety fine-tuning methods.

The paper examines three well-known safety fine-tuning techniques: supervised safety fine-tuning, direct preference optimization, and unlearning. The researchers found that these methods primarily transform the model's weights to align unsafe inputs with the model's "null space" - a part of the internal representation that is effectively ignored or suppressed.

This weight adjustment leads to a clear clustering of inputs based on whether the model deems them safe or unsafe. As a result, when an adversarial input (e.g., "jailbreak") is provided, its activations are closer to the safer samples, causing the model to process it as if it were safe.

The researchers validate their findings on real-world models, specifically LLaMA-2 7B and LLaMA-3 8B, to ensure the generalizability of their observations.

Critical Analysis

The paper provides valuable insights into the underlying mechanisms of safety fine-tuning, but it also acknowledges several limitations and avenues for further research. For example, the synthetic data generation framework, while useful for studying the conceptual factors, may not fully capture the complexity of real-world unsafe inputs.

Additionally, the researchers focus on the weight space transformation as the primary mechanism for safety fine-tuning, but there may be other factors, such as the model's internal representations or the dynamics of the fine-tuning process, that also play a role in determining the model's safety.

It would be interesting to explore how these safety fine-tuning methods perform on a broader range of LLMs, including those with different architectures or training approaches. Additionally, investigating the long-term stability and robustness of the safety characteristics would be an important area for further research.

Conclusion

This paper sheds light on the factors that contribute to the safety of LLMs through the process of safety fine-tuning. By designing a synthetic data generation framework and investigating three well-known safety fine-tuning methods, the researchers provide evidence that these methods primarily transform the model's weights to align unsafe inputs with the model's "null space".

This weight adjustment leads to a clear separation between the model's perception of safe and unsafe inputs, allowing the model to process even adversarial inputs as if they were safe. The findings are validated on real-world models, highlighting the practical implications of this research for the safe deployment of LLMs.

While the paper offers valuable insights, it also identifies areas for further exploration, such as the potential for other factors influencing safety and the long-term stability of the safety characteristics. As the development and deployment of LLMs continue to evolve, this research contributes to a deeper understanding of the mechanisms that can ensure their safe and responsible use.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

What Makes and Breaks Safety Fine-tuning? Mechanistic Study

Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H. S. Torr, Amartya Sanyal, Puneet K. Dokania

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., design) versus the specific concepts the task is asked to be performed upon (e.g., a cycle vs. a bomb). Using this, we investigate three well-known safety fine-tuning methods -- supervised safety fine-tuning, direct preference optimization, and unlearning -- and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights' null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe. We validate our findings, wherever possible, on real-world models -- specifically, Llama-2 7B and Llama-3 8B.

8/22/2024

👀

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales

Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories. Our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs. This alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. The versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing VLLMs, training new models or safeguarding pre-trained VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. The code and dataset are available at https://github.com/ys-zong/VLGuard.

6/19/2024

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

Safety alignment is the key to guiding the behaviors of large language models (LLMs) that are in line with human preferences and restrict harmful behaviors at inference time, but recent studies show that it can be easily compromised by finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as safety basin: randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community.

5/29/2024

💬

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi

Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work addresses the gap in our understanding of these risks across diverse types of data in closed models - where providers control how user data is utilized in the fine-tuning process. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.

7/2/2024