RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

2404.14397

Published 4/23/2024 by Adrian de Wynter, Ishaan Watts, Nektar Ege Alt{i}ntoprak, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Lena Baur, Samantha Claudet, Pavel Gajdusek, Can Goren and 23 others

cs.CL cs.CY cs.LG

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate seven S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when judging holistically the toxicity of a prompt, and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microagressions, bias). We release of this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Create account to get full access

Overview

The paper explores the ability of large language models (LLMs) to evaluate toxicity in multilingual scenarios.
It investigates the performance of LLMs in detecting toxic content across different languages.
The researchers develop a benchmark called RTP-LX to assess the multilingual toxicity detection capabilities of LLMs.

Plain English Explanation

In this research, the authors investigate whether large language models (LLMs) can effectively evaluate the toxicity of text in multiple languages. Toxicity refers to harmful or offensive content that can be present in online discussions, social media, and other forms of digital communication.

The researchers create a new benchmark called RTP-LX to assess how well LLMs can detect toxic content in different languages. This is an important task because language-specific toxicity can be difficult for machines to identify, especially when dealing with diverse cultural and linguistic contexts.

The key idea is to test the performance of LLMs in recognizing toxic language across a range of languages, rather than just focusing on a single language like English. By developing this multilingual benchmark, the researchers aim to better understand the capabilities and limitations of LLMs in addressing the challenge of toxic content in global, online environments.

Technical Explanation

The paper presents the RTP-LX (Robust Toxicity Prediction in Multilingual Scenarios) benchmark, which is designed to evaluate the ability of LLMs to detect toxic content across multiple languages. The benchmark includes datasets in 7 different languages (English, Spanish, Arabic, Hindi, Chinese, Russian, and French) and covers a range of toxic content types, such as hate speech, personal attacks, and profanity.

The researchers assess the performance of several prominent LLMs, including GPT-3, T5, and Multilingual BERT, on the RTP-LX benchmark. They examine metrics such as F1 score, precision, and recall to understand how effectively the models can identify toxic content in the multilingual datasets.

The results suggest that while LLMs can achieve relatively strong performance on toxicity detection in some languages, their capabilities vary significantly across different languages and content types. The paper highlights the challenges of developing robust, multilingual toxicity detection systems and the need for further research in this area.

Critical Analysis

The RTP-LX benchmark represents a valuable contribution to the field, as it provides a standardized way to assess the multilingual toxicity detection capabilities of LLMs. By including datasets in multiple languages, the benchmark helps to identify the strengths and limitations of current LLM models in handling diverse linguistic and cultural contexts.

However, the paper does not delve into the potential biases or inconsistencies that may arise in the datasets or the models themselves. For example, the researchers do not discuss how the datasets were curated or annotated, which could influence the results. Additionally, the paper does not explore the impact of the LLM's training data and architecture on its performance across different languages.

Further research is needed to better understand the factors that contribute to the varied performance of LLMs in multilingual toxicity detection. Exploring techniques such as safetyprompts or METAL may help to improve the robustness and fairness of these systems.

Conclusion

The RTP-LX benchmark developed in this paper represents an important step towards evaluating the multilingual toxicity detection capabilities of large language models. The findings suggest that while LLMs can be effective in certain languages, they still face challenges in consistently identifying toxic content across diverse linguistic and cultural contexts.

This research highlights the need for continued efforts to develop safe and responsible LLMs that can handle complex, multilingual scenarios. By addressing these challenges, researchers and developers can work towards more comprehensive and accurate language model evaluation, ultimately leading to more robust and trustworthy AI systems that can effectively address the issue of online toxicity on a global scale.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap

Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.

5/21/2024

cs.CL

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever.

5/31/2024

cs.CL cs.AI

Realistic Evaluation of Toxicity in Large Language Models

Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, Thien Huu Nguyen

Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

5/21/2024

cs.CL cs.AI

💬

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

5/28/2024

cs.CL