Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Read original: arXiv:2402.02207 - Published 6/19/2024 by Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales

👀

Overview

Current large language models for vision and language (VLLMs) have remarkable capabilities but can generate harmful content and are vulnerable to even simple attacks.
The researchers found that this is due to the presence of harmful data during fine-tuning, and that fine-tuning can cause VLLMs to forget safety alignment previously learned by the underlying language model.
To address this, the researchers curated a dataset of safe vision-language instructions called VLGuard.
Fine-tuning VLLMs on this dataset effectively aligns them for safety with minimal impact on helpfulness.
The dataset can also be used to safety-test existing VLLMs and train new, safer models.

Plain English Explanation

Large language models that can understand and generate both text and images (called vision-language large language models or VLLMs) have become incredibly powerful. However, these models can also produce harmful or unsafe content, and they are vulnerable to simple attacks that can bypass their safeguards.

The researchers found that this is largely due to the data the models are trained on - it often contains harmful or unethical examples, and when the models are further trained (or "fine-tuned") on vision-language tasks, they can actually forget the safety lessons they had previously learned.

To solve this problem, the researchers created a new dataset called VLGuard, which contains a variety of safe and ethical vision-language instructions. They showed that by fine-tuning VLLMs on this dataset, the models become much better at avoiding harmful outputs and are much more resilient to attacks.

Importantly, this safety alignment was achieved without significantly impacting the models' overall helpfulness and capabilities. The VLGuard dataset can also be used to test the safety of existing VLLMs, as well as to train new, safer models from the ground up.

Technical Explanation

The researchers started by analyzing current VLLMs and found that they are prone to generating harmful content and are vulnerable to even simple "jailbreaking" attacks that can bypass their safety mechanisms.

Their analysis revealed that this is due to the presence of harmful data during the vision-language fine-tuning process, as well as the fact that fine-tuning can cause the models to "forget" the safety alignment that was previously learned by the underlying language model.

To address this issue, the researchers curated a new dataset called VLGuard, which contains a diverse set of safe and ethical vision-language instructions covering a wide range of categories. They then conducted experiments to demonstrate that integrating this dataset into standard VLLM fine-tuning, or using it for post-hoc fine-tuning, effectively aligns the models for safety.

Importantly, this safety alignment was achieved with minimal impact on, or even enhancement of, the models' overall helpfulness and capabilities. The researchers also showed that the VLGuard dataset can be used to safety-test existing VLLMs as well as to train new, safer models from scratch.

Critical Analysis

The researchers acknowledge that their work is an initial step towards addressing the safety issues in VLLMs, and there are still many open questions and areas for further research.

For example, the VLGuard dataset, while diverse, may not cover all possible harmful scenarios, and there could be edge cases or unforeseen ways in which VLLMs could still produce unsafe outputs. Additionally, the researchers only tested their approach on a limited number of VLLM architectures, and it's unclear how well it would generalize to other models.

Furthermore, the researchers did not delve into the potential societal implications of their work. While making VLLMs safer is an important goal, there are complex ethical and philosophical questions around the use of such powerful AI systems that the paper does not address.

Overall, the researchers have made a valuable contribution by identifying a key problem and proposing a promising solution. However, continued research, interdisciplinary collaboration, and thoughtful consideration of the broader implications will be essential to ensure that VLLMs are developed and deployed responsibly.

Conclusion

The researchers have developed a novel approach to address the safety and robustness issues in current vision-language large language models (VLLMs). By creating a specialized dataset called VLGuard and using it for fine-tuning, they have shown that VLLMs can be effectively aligned for safety while maintaining or even enhancing their overall helpfulness and capabilities.

This work represents an important step forward in the quest to build AI systems that are both powerful and trustworthy. The VLGuard dataset and the researchers' fine-tuning techniques can be valuable tools for safety-testing existing VLLMs, training new models, and safeguarding the broader ecosystem of large language models.

As AI systems continue to advance, it will be critical to prioritize safety and ethics alongside capability development. The researchers' contribution highlights the importance of proactive, principled approaches to AI safety, and it serves as a model for how the field can work towards realizing the full potential of transformative technologies while mitigating their risks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales

Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories. Our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs. This alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. The versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing VLLMs, training new models or safeguarding pre-trained VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. The code and dataset are available at https://github.com/ys-zong/VLGuard.

6/19/2024

👀

Safety Alignment for Vision Language Models

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual Language Models (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.

5/24/2024

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

Safety alignment is the key to guiding the behaviors of large language models (LLMs) that are in line with human preferences and restrict harmful behaviors at inference time, but recent studies show that it can be easily compromised by finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as safety basin: randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community.

5/29/2024

🔍

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

Georgios Pantazopoulos, Amit Parekh, Malvina Nikandrou, Alessandro Suglia

Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.

5/8/2024