Cross-Modality Safety Alignment

Read original: arXiv:2406.15279 - Published 6/24/2024 by Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

🛸

Overview

As Artificial General Intelligence (AGI) becomes more integrated into our lives, ensuring the safety and ethical alignment of these systems is crucial.
Previous studies have focused on single-modality threats, which may not be sufficient given the complex, cross-modality interactions involved.
The paper introduces a novel safety alignment challenge called "Safe Inputs but Unsafe Output" (SIUO) to evaluate cross-modality safety alignment.
The SIUO benchmark covers 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
The findings reveal substantial safety vulnerabilities in both closed- and open-source large language and vision-language models, like GPT-4V and LLaVA, highlighting the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

Plain English Explanation

As artificial intelligence (AI) becomes more advanced and integrated into our daily lives, it's crucial that we ensure these systems are safe and aligned with ethical principles. Previous research has focused on individual threats or risks, but the paper argues that this may not be enough, as AI systems can have complex interactions between different types of inputs (like text, images, and audio) that could lead to unsafe or unethical outputs.

The researchers introduce a new challenge called "Safe Inputs but Unsafe Output" (SIUO) to study this problem. The SIUO benchmark covers 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Essentially, the idea is to test if AI models can handle individual inputs that are safe on their own but could become dangerous when combined.

The researchers found that both closed-source and open-source large language models and vision-language models have substantial vulnerabilities when it comes to this type of cross-modal safety. This suggests that current AI models are not yet capable of reliably interpreting and responding to the complex, real-world scenarios that may arise as AI becomes more integrated into our lives.

Technical Explanation

The paper introduces a novel safety alignment challenge called "Safe Inputs but Unsafe Output" (SIUO) to evaluate the cross-modality safety of AI systems. The SIUO benchmark encompasses 9 critical safety domains, including self-harm, illegal activities, and privacy violations.

The key idea behind SIUO is to consider cases where individual input modalities (e.g., text, images) are safe on their own, but when combined, they could potentially lead to unsafe or unethical outputs. This is in contrast to previous studies that have primarily focused on single-modality threats.

To investigate this problem empirically, the researchers developed the SIUO benchmark and evaluated the performance of both closed-source and open-source large language models (LLMs), such as GPT-4V, and vision-language models (VLMs), like LLaVA. The findings reveal substantial safety vulnerabilities in these models, highlighting the inadequacy of current approaches to reliably interpret and respond to complex, real-world scenarios involving cross-modal interactions.

Critical Analysis

The paper provides a compelling argument for the need to consider cross-modal safety alignment as AI systems become more integrated into our lives. The SIUO benchmark is a novel and valuable contribution to the field, as it focuses on a critical gap in existing research.

However, the paper does not provide detailed information about the specific safety domains or the criteria used to evaluate the models' performance. Additionally, the paper does not discuss the potential limitations of the SIUO benchmark or the challenges in developing robust cross-modal safety alignment solutions.

Further research is needed to explore the underlying causes of the safety vulnerabilities identified in the study and to develop more comprehensive approaches to ensuring the safety and ethical alignment of AI systems, especially in light of the potential perils of image inputs and the need for pioneering AI safety.

Conclusion

This paper highlights a critical and underexplored aspect of AI safety – the potential for cross-modal interactions to lead to unsafe or unethical outputs, even when individual input modalities are deemed safe. The introduction of the SIUO benchmark and the findings revealing significant vulnerabilities in current LLMs and VLMs underscore the urgent need for more research and development in the area of cross-modal safety alignment.

As AI becomes increasingly integrated into our lives, ensuring the safety and ethical alignment of these systems must be a top priority. The insights from this paper provide a valuable starting point for the AI research community to address this challenge and work towards the development of more robust and trustworthy AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Cross-Modality Safety Alignment

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

6/24/2024

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song

Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.

6/6/2024

👀

Safety Alignment for Vision Language Models

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual Language Models (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.

5/24/2024

💬

New!CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li

The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?. We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model's safety awareness becomes boosted. This phenomenon verifies the existence of MLLM's safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks.

9/18/2024