You are an expert annotator: Automatic Best-Worst-Scaling Annotations for Emotion Intensity Modeling

Read original: arXiv:2403.17612 - Published 4/23/2024 by Christopher Bagdon, Prathamesh Karmalker, Harsha Gurulingappa, Roman Klinger

You are an expert annotator: Automatic Best-Worst-Scaling Annotations for Emotion Intensity Modeling

Overview

This paper proposes an automated approach for annotating emotion intensity in text using a Best-Worst Scaling (BWS) method.
The method trains a large language model to act as an "expert annotator" that can scale up the annotation process beyond what human annotators can feasibly do.
The authors evaluate their approach on several emotion intensity datasets and find that it outperforms traditional human annotation methods.

Plain English Explanation

When people write text, they often express emotions like happiness, sadness, anger, and so on. Researchers who study these emotions need to be able to measure the intensity of the emotions expressed in the text. This is usually done by having human annotators read the text and rate the intensity of the emotions.

However, having humans do this kind of annotation can be time-consuming and expensive, especially when working with large amounts of text. The authors of this paper came up with a way to automate the annotation process using a technique called Best-Worst Scaling (BWS).

The key idea is to train a large language model to act as an "expert annotator" that can rate the emotion intensity in text. This allows the annotation process to be scaled up much more efficiently than relying on human annotators alone.

The authors evaluated their automated approach on several datasets of text with emotion intensity annotations. They found that their method was able to match or even outperform the accuracy of traditional human annotation methods. This suggests that automated annotation could be a valuable tool for emotion research going forward.

Technical Explanation

The paper proposes an automated approach for annotating emotion intensity in text using a Best-Worst Scaling (BWS) method. BWS is a pairwise comparison technique where annotators are shown a set of items and asked to identify the best and worst items in the set.

The authors train a large pre-trained language model to act as an "expert annotator" that can perform this BWS task. This allows the annotation process to be scaled up much more efficiently than relying on traditional human annotation.

Specifically, the authors fine-tune the language model on a dataset of text samples that have been manually annotated for emotion intensity. The model is then used to generate pairwise comparisons of the emotion intensity for new text samples, which are aggregated to produce an overall intensity score.

The authors evaluate their approach on several emotion intensity datasets, including SSEC, EmoInt, and EmoContext. They find that their automated approach matches or exceeds the performance of traditional human annotation methods, suggesting it could be a valuable tool for scaling up emotion intensity research.

Critical Analysis

The authors acknowledge several limitations of their approach. First, the performance of the automated annotator is still dependent on the quality and coverage of the training data used to fine-tune the language model. If the training data is biased or incomplete, the model's annotations may also be biased or inaccurate.

Additionally, the authors note that their approach relies on the assumption that the language model can effectively capture the nuances of human emotion expression. It's possible that there are aspects of emotion intensity that the model struggles to fully capture, particularly for more subtle or context-dependent emotional expressions.

Further research would be needed to better understand the failure modes and limitations of using language models as virtual annotators. It would also be valuable to explore ways of improving the robustness and generalization of these automated annotation approaches.

Conclusion

This paper presents an innovative approach for automating the annotation of emotion intensity in text using a large language model and Best-Worst Scaling. The authors demonstrate that this automated method can match or outperform traditional human annotation, suggesting it could be a powerful tool for scaling up emotion research.

While the approach has some limitations, it represents an important step forward in leveraging the capabilities of large language models for annotation tasks. As these models continue to advance, we may see even more sophisticated and accurate automated annotation techniques emerge, further reducing the burden on human annotators.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

You are an expert annotator: Automatic Best-Worst-Scaling Annotations for Emotion Intensity Modeling

Christopher Bagdon, Prathamesh Karmalker, Harsha Gurulingappa, Roman Klinger

Labeling corpora constitutes a bottleneck to create models for new tasks or domains. Large language models mitigate the issue with automatic corpus labeling methods, particularly for categorical annotations. Some NLP tasks such as emotion intensity prediction, however, require text regression, but there is no work on automating annotations for continuous label assignments. Regression is considered more challenging than classification: The fact that humans perform worse when tasked to choose values from a rating scale lead to comparative annotation methods, including best-worst scaling. This raises the question if large language model-based annotation methods show similar patterns, namely that they perform worse on rating scale annotation tasks than on comparative annotation tasks. To study this, we automate emotion intensity predictions and compare direct rating scale predictions, pairwise comparisons and best-worst scaling. We find that the latter shows the highest reliability. A transformer regressor fine-tuned on these data performs nearly on par with a model trained on the original manual annotations.

4/23/2024

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Xu Han, Felix Yu, Joao Sedoc, Benjamin Van Durme

Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, what percent positive or negative is this product review? When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation (Likert scales). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.

8/20/2024

The Whole Is Bigger Than the Sum of Its Parts: Modeling Individual Annotators to Capture Emotional Variability

James Tavernor, Yara El-Tawil, Emily Mower Provost

Emotion expression and perception are nuanced, complex, and highly subjective processes. When multiple annotators label emotional data, the resulting labels contain high variability. Most speech emotion recognition tasks address this by averaging annotator labels as ground truth. However, this process omits the nuance of emotion and inter-annotator variability, which are important signals to capture. Previous work has attempted to learn distributions to capture emotion variability, but these methods also lose information about the individual annotators. We address these limitations by learning to predict individual annotators and by introducing a novel method to create distributions from continuous model outputs that permit the learning of emotion distributions during model training. We show that this combined approach can result in emotion distributions that are more accurate than those seen in prior work, in both within- and cross-corpus settings.

8/23/2024

🌿

Using Natural Language Explanations to Rescale Human Judgments

Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett

The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over human judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may reflect different qualitative judgments about an example, and they may be mapped to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.

9/10/2024