From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

2403.03893

Published 5/31/2024 by Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

Abstract

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever.

Create account to get full access

Overview

This paper explores expanding the scope of toxicity mitigation in language models beyond the typical focus on a single target language.
The authors propose methods for evaluating and mitigating toxic language generation in multiple languages, including lesser-resourced ones.
The research builds on previous work on realistic evaluation of toxicity in large language models, multilingual toxicity evaluation, and toxicity classification in Ukrainian.

Plain English Explanation

The paper looks at ways to address toxic or harmful language generated by AI language models in more than just one language. Typically, efforts to make these models less toxic have focused on a single target language, like English. But the authors wanted to expand this to multiple languages, including less common ones.

They build on previous research that has looked at realistically evaluating toxicity in large language models, as well as work on evaluating toxicity across multiple languages and classifying toxic language in Ukrainian. The key idea is to develop methods that can identify and mitigate toxic outputs from AI language models in a broader range of languages, not just the most widely used ones.

Technical Explanation

The paper describes experiments that evaluate and mitigate toxic language generation in multiple languages, including less-resourced ones. The authors draw on previous work such as Realistic Evaluation of Toxicity in Large Language Models, RTP-LX: Can LLMs Evaluate Toxicity in a Multilingual Setting?, PolyglotOxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models, and Toxicity Classification in Ukrainian.

The experimental setup involves prompting language models with a diverse set of multilingual prompts designed to elicit toxic responses. The authors then evaluate the models' outputs using multilingual toxicity classifiers and explore methods for mitigating toxic generation, such as fine-tuning and prompting techniques.

Critical Analysis

The paper acknowledges the challenges of evaluating and mitigating toxicity in a multilingual setting, where data and resources may be more limited for certain languages. The authors note that their approach relies on the availability and quality of multilingual toxicity datasets and classifiers, which may not be readily available for all languages.

Additionally, the paper does not address the potential societal and ethical implications of deploying toxicity mitigation techniques in real-world scenarios, where there may be complex tradeoffs between free speech, content moderation, and algorithmic bias. Further research in this area would be valuable.

Conclusion

This paper represents an important step in expanding the scope of toxicity mitigation in language models beyond the typical focus on a single target language. By developing methods for evaluating and mitigating toxic language generation in multiple languages, including less-resourced ones, the authors contribute to the ongoing efforts to make AI systems more safe and responsible. The insights from this research could have significant implications for the development of more inclusive and equitable language technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Realistic Evaluation of Toxicity in Large Language Models

Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, Thien Huu Nguyen

Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

5/21/2024

cs.CL cs.AI

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Adrian de Wynter, Ishaan Watts, Nektar Ege Alt{i}ntoprak, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Lena Baur, Samantha Claudet, Pavel Gajdusek, Can Goren, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanovi'c, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Strika, Yueh Tsao, Davide Turcato, Oleksandr Vakhno, Judit Velcsov, Anna Vickers, St'ephanie Visser, Herdyan Widarmanto, Andrey Zaikin, Si-Qing Chen

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate seven S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when judging holistically the toxicity of a prompt, and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microagressions, bias). We release of this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

4/23/2024

cs.CL cs.CY cs.LG

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap

Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.

5/21/2024

cs.CL

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

6/21/2024

cs.CL cs.AI