Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Read original: arXiv:2403.12388 - Published 6/11/2024 by Ying-Chun Lin, Jennifer Neville, Jack W. Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu and 7 others

Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Overview

This paper proposes SPUR, a method for estimating user satisfaction in conversational systems that use large language models.
SPUR aims to provide an interpretable and accurate way to assess user satisfaction, which is important for developing and improving conversational AI systems.
The paper evaluates SPUR on several datasets and compares it to existing approaches, demonstrating its effectiveness.

Plain English Explanation

The paper describes a new method called SPUR for estimating how satisfied users are with conversational AI systems that use large language models. Measuring user satisfaction is crucial for improving these systems, but it can be challenging. SPUR provides an interpretable and accurate way to assess satisfaction, which means users and developers can understand why the system is making the assessments it does.

The researchers tested SPUR on several different datasets of conversations between humans and AI systems. They found that SPUR performed better at estimating user satisfaction than other existing approaches. This suggests SPUR could be a valuable tool for developers building and refining conversational AI that can engage with users in a more natural and satisfying way.

Technical Explanation

The paper introduces SPUR (Satisfaction Prediction Using Responsiveness), a method for estimating user satisfaction in conversational systems that utilize large language models. SPUR aims to provide an interpretable and accurate way to assess user satisfaction, which is crucial for developing and improving such AI systems.

The key innovation of SPUR is its use of "responsiveness" - the degree to which the system's responses are relevant, coherent, and aligned with the user's intent. The authors hypothesize that responsiveness is a strong indicator of user satisfaction. SPUR models responsiveness using a combination of language models and user feedback signals.

The paper evaluates SPUR on several benchmark datasets for conversational systems, including link to "User-Centric Benchmark for Evaluating Large Language Models" and link to "Rethinking Evaluation of Dialogue Systems: Effects of User Feedback". The results show that SPUR outperforms existing approaches in predicting user satisfaction. Additionally, the authors demonstrate that SPUR's estimates are more interpretable, as they can be traced back to specific aspects of the system's responsiveness.

Critical Analysis

The paper makes a compelling case for the utility of SPUR in assessing user satisfaction for conversational systems powered by large language models. The authors' focus on interpretability is particularly noteworthy, as it allows for more meaningful feedback and iterative improvement of these AI systems.

However, the paper does not address potential limitations of the SPUR approach. For example, it is unclear how SPUR would handle more subjective or context-dependent aspects of user satisfaction, beyond the responsiveness measure. Additionally, the reliance on user feedback signals raises questions about the scalability and robustness of the approach, especially for systems with diverse user bases.

Further research could explore ways to incorporate additional signals of user satisfaction, such as link to "Can Large Language Models Assess Serendipity in Recommender Systems?" or link to "Large Language Models Can Accurately Predict Searcher Satisfaction". Additionally, investigating the generalizability of SPUR to other types of conversational AI, beyond those based on large language models, could expand the practical applications of this approach.

Conclusion

The link to "Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models" paper presents a promising approach for estimating user satisfaction in conversational AI systems. SPUR's focus on interpretability and its demonstrated effectiveness on benchmark datasets suggest it could be a valuable tool for developers working to create more engaging and satisfying conversational experiences. While the method has some limitations, the underlying principles and insights offered by this research could help advance the field of conversational AI and its ability to meet the needs of users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Ying-Chun Lin, Jennifer Neville, Jack W. Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu, Deepak Gupta, Sujay Kumar Jauhar, Xia Song, Georg Buscher, Saurabh Tiwary, Brent Hecht, Jaime Teevan

Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.

6/11/2024

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems

Amin Abolghasemi, Zhaochun Ren, Arian Askari, Mohammad Aliannejadi, Maarten de Rijke, Suzan Verberne

An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.

8/21/2024

Using LLMs to Investigate Correlations of Conversational Follow-up Queries with User Satisfaction

Hyunwoo Kim, Yoonseo Choi, Taehyun Yang, Honggu Lee, Chaneon Park, Yongju Lee, Jin Young Kim, Juho Kim

With large language models (LLMs), conversational search engines shift how users retrieve information from the web by enabling natural conversations to express their search intents over multiple turns. Users' natural conversation embodies rich but implicit signals of users' search intents and evaluation of search results to understand user experience with the system. However, it is underexplored how and why users ask follow-up queries to continue conversations with conversational search engines and how the follow-up queries signal users' satisfaction. From qualitative analysis of 250 conversational turns from an in-lab user evaluation of Naver Cue:, a commercial conversational search engine, we propose a taxonomy of 18 users' follow-up query patterns from conversational search, comprising two major axes: (1) users' motivations behind continuing conversations (N = 7) and (2) actions of follow-up queries (N = 11). Compared to the existing literature on query reformulations, we uncovered a new set of motivations and actions behind follow-up queries, including asking for subjective opinions or providing natural language feedback on the engine's responses. To analyze conversational search logs with our taxonomy in a scalable and efficient manner, we built an LLM-powered classifier (73% accuracy). With our classifier, we analyzed 2,061 conversational tuples collected from real-world usage logs of Cue: and examined how the conversation patterns from our taxonomy correlates with satisfaction. Our initial findings suggest some signals of dissatisfactions, such as Clarifying Queries, Excluding Condition, and Substituting Condition with follow-up queries. We envision our approach could contribute to automated evaluation of conversation search experience by providing satisfaction signals and grounds for realistic user simulations.

7/19/2024

💬

EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context

Hannes Kunstmann, Joseph Ollier, Joel Persson, Florian von Wangenheim

Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of $0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.

7/10/2024