Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

2406.13905

Published 6/21/2024 by Mohamed Elaraby, Diane Litman, Xiang Lorraine Li, Ahmed Magooda

Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

Abstract

Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.

Create account to get full access

Overview

This paper explores the persuasiveness of free-text rationales generated by language models in the context of subjective decision-making tasks, using pairwise argument ranking as a case study.
The researchers investigate how well these generated rationales can convince human evaluators compared to ground truth rationales provided by the argument authors.
The study provides insights into the capabilities and limitations of large language models in producing persuasive written justifications for subjective decisions.

Plain English Explanation

In this research, the authors looked at how convincing the explanations (or "rationales") generated by AI language models can be, when applied to tasks that involve subjective decisions. They focused on a specific case study: pairwise argument ranking.

Pairwise argument ranking is a task where people are presented with two arguments on a topic and asked to decide which one is more persuasive. The researchers wanted to see how well AI-generated rationales could convince human evaluators, compared to the original explanations written by the argument authors themselves.

The key idea is that as language models become more advanced, they may be able to produce human-like justifications for decisions. But it's not clear how persuasive or compelling these AI-generated rationales would be, especially for subjective tasks that involve opinion and judgment.

By studying this in the context of pairwise argument ranking, the researchers aimed to better understand the current capabilities and limitations of large language models when it comes to generating persuasive free-text explanations. This could have important implications as these models are increasingly used to assist or even automate decision-making processes.

Technical Explanation

The paper presents a study on the persuasiveness of free-text rationales generated by language models in the context of pairwise argument ranking. Pairwise argument ranking is a task where human judges are shown two arguments on a topic and asked to decide which one is more persuasive.

The researchers compared the persuasiveness of AI-generated rationales to ground truth rationales written by the original argument authors. They used a large language model fine-tuned on a dataset of arguments and their corresponding rationales to generate free-text justifications for the pairwise comparisons.

These generated rationales were then shown to human evaluators, who were asked to rate how persuasive they found each one. The researchers analyzed the results to understand how the persuasiveness of the AI-generated rationales compared to the ground truth rationales.

The study provides insights into the current capabilities and limitations of large language models in producing written justifications that can effectively sway human opinions on subjective decision-making tasks. The findings have implications for the use of these models in decision support systems and other applications where generating persuasive explanations is important.

Critical Analysis

The paper makes a valuable contribution by exploring an important real-world application of large language models: generating persuasive free-text rationales for subjective decisions. However, the study also highlights some key limitations and areas for further research.

One limitation is the specific domain of pairwise argument ranking, which may not fully capture the complexity of real-world decision-making. The arguments used in the study were relatively short and focused, whereas actual decisions often involve weighing much more nuanced considerations.

Additionally, the study only looked at the persuasiveness of the rationales in isolation, without considering other factors that could influence human judgments, such as the quality of the arguments themselves or the perceived credibility of the source. Integrating these additional elements could provide a more holistic understanding of how AI-generated justifications would perform in practice.

Further research could also explore the boundary conditions of when language models are able to generate persuasive rationales. The current study focused on relatively simple subjective judgments, but more complex or emotionally charged decisions may push the limits of these models' capabilities.

Overall, this paper represents an important step in understanding the potential and limitations of using large language models to assist with subjective decision-making. As these models continue to advance, it will be crucial to carefully evaluate their performance in a range of real-world contexts.

Conclusion

This study provides valuable insights into the persuasiveness of free-text rationales generated by language models in the context of subjective decision-making tasks. The researchers found that while AI-generated rationales can be somewhat persuasive, they generally do not reach the level of convincingness of ground truth explanations written by the original argument authors.

These findings have important implications as large language models become increasingly integrated into decision support systems and other applications where generating persuasive justifications is important. The limitations highlighted in the study suggest that there is still substantial room for improvement in the ability of these models to produce human-level, contextually-appropriate rationales for subjective decisions.

As the field of AI continues to advance, further research will be needed to better understand the boundary conditions and potential pitfalls of using language models for tasks that require nuanced reasoning and persuasive communication. Ultimately, this work contributes to a broader understanding of the capabilities and limitations of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!Free-text Rationale Generation under Readability Level Control

Yi-Sheng Hsu, Nils Feldhus, Sherzod Hakimov

Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs' ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.

7/2/2024

cs.CL

New!Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng

We study how well large language models (LLMs) explain their generations with rationales -- a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.

7/2/2024

cs.CL cs.AI

🛠️

Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring

Jiazheng Li, Hainiu Xu, Zhaoyue Sun, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He

Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths.

7/1/2024

cs.CL

RORA: Robust Free-Text Rationale Evaluation

Zhengping Jiang, Yining Lu, Hanjie Chen, Daniel Khashabi, Benjamin Van Durme, Anqi Liu

Free-text rationales play a pivotal role in explainable NLP, bridging the knowledge and reasoning gaps behind a model's decision-making. However, due to the diversity of potential reasoning paths and a corresponding lack of definitive ground truth, their evaluation remains a challenge. Existing evaluation metrics rely on the degree to which a rationale supports a target label, but we find these fall short in evaluating rationales that inadvertently leak the labels. To address this problem, we propose RORA, a Robust free-text Rationale evaluation against label leakage. RORA quantifies the new information supplied by a rationale to justify the label. This is achieved by assessing the conditional V-information citep{hewitt-etal-2021-conditional} with a predictive family robust against leaky features that can be exploited by a small model. RORA consistently outperforms existing approaches in evaluating human-written, synthetic, or model-generated rationales, particularly demonstrating robustness against label leakage. We also show that RORA aligns well with human judgment, providing a more reliable and accurate measurement across diverse free-text rationales.

6/18/2024

cs.CL