Annotation alignment: Comparing LLM and human annotations of conversational safety

Read original: arXiv:2406.06369 - Published 6/24/2024 by Rajiv Movva, Pang Wei Koh, Emma Pierson

Annotation alignment: Comparing LLM and human annotations of conversational safety

Overview

Compares annotations of conversational safety made by large language models (LLMs) and human annotators
Investigates the alignment between LLM and human judgments on sensitive conversational content
Explores how well LLMs can capture human perspectives on what is considered safe or unsafe to say in conversations

Plain English Explanation

This research paper examines how well large language models (LLMs) can identify potentially unsafe or sensitive content in conversations, compared to human annotators. The researchers wanted to understand if LLMs, which are trained on vast amounts of online data, can accurately capture the same social and ethical judgments that humans make about what is appropriate to say.

The researchers had both LLMs and human annotators evaluate a set of conversational snippets and rate them for safety. They then compared the LLM and human ratings to see how well they aligned. By understanding the similarities and differences between LLM and human judgments, the researchers can shed light on how well these AI systems can mimic human perspectives on conversational safety.

This is an important area of research as LLMs are increasingly being used in conversational AI applications, where it is critical that the systems have a nuanced understanding of what is socially appropriate to say. The findings from this study can help inform the development of more ethical and aligned AI assistants.

Technical Explanation

The researchers used a dataset of over 35,000 conversational snippets that had been previously annotated by humans for safety. They then ran these snippets through several prominent LLMs, including GPT-3, and had the models provide their own safety ratings for each snippet.

By comparing the LLM ratings to the original human annotations, the researchers were able to assess the degree of alignment between the two. They calculated various statistical measures, such as correlation coefficients and mean squared error, to quantify the similarity between the LLM and human judgments.

The results showed that while the LLMs were able to capture some of the overall trends in the human annotations, there were also significant differences. The LLMs tended to be more permissive, rating snippets as safer than the human annotators. The researchers hypothesize this may be due to the LLMs' training data, which may not fully reflect human social and ethical norms.

The paper also investigates how the level of alignment varies based on factors like the specific LLM used, the length of the conversational snippet, and the type of safety issue involved (e.g. toxicity, bias, etc.). These analyses provide insights into the strengths and limitations of current LLM capabilities when it comes to assessing conversational safety.

Critical Analysis

The researchers acknowledge several limitations in their study. First, the human annotations used as the ground truth may themselves be biased or inconsistent. Additionally, the conversational snippets may not fully capture the nuances and contextual factors that humans consider when making safety judgments.

The researchers also note that their analysis focused only on the overall alignment between LLMs and humans, without delving into the specific types of errors or disagreements. A more fine-grained examination of the misalignments could yield additional insights.

Finally, the study does not explore potential ways to improve the alignment between LLMs and human perspectives on conversational safety. Future research could investigate techniques for fine-tuning or prompting LLMs to better match human ethical and social norms.

Conclusion

This study provides important empirical evidence on the gap between LLM and human judgments of conversational safety. While LLMs show some ability to identify potentially unsafe content, they do not fully capture the nuanced social and ethical considerations that humans apply.

As LLMs become more widely deployed in conversational AI applications, understanding and addressing these misalignments will be critical to ensuring the systems behave in a manner that is consistent with human values and norms. This research represents an important step towards developing more ethical and aligned AI assistants.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva, Pang Wei Koh, Emma Pierson

To what extent to do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.

6/24/2024

Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture

Jiayang Song, Yuheng Huang, Zhehua Zhou, Lei Ma

As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.

7/11/2024

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?

Lennart Wachowiak, Andrew Coles, Oya Celiktutan, Gerard Canal

Large language models (LLMs) are increasingly used in robotics, especially for high-level action planning. Meanwhile, many robotics applications involve human supervisors or collaborators. Hence, it is crucial for LLMs to generate socially acceptable actions that align with people's preferences and values. In this work, we test whether LLMs capture people's intuitions about behavior judgments and communication preferences in human-robot interaction (HRI) scenarios. For evaluation, we reproduce three HRI user studies, comparing the output of LLMs with that of real participants. We find that GPT-4 strongly outperforms other models, generating answers that correlate strongly with users' answers in two studies $unicode{x2014}$ the first study dealing with selecting the most appropriate communicative act for a robot in various situations ($r_s$ = 0.82), and the second with judging the desirability, intentionality, and surprisingness of behavior ($r_s$ = 0.83). However, for the last study, testing whether people judge the behavior of robots and humans differently, no model achieves strong correlations. Moreover, we show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior desirability higher than people.

7/10/2024

Exploring Safety-Utility Trade-Offs in Personalized Language Models

Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi

As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.

6/18/2024