CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Read original: arXiv:2405.12063 - Published 6/4/2024 by Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wenqiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, Tat-Seng Chua

$CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models$

Overview

Introduces CLAMBER, a new benchmark for evaluating large language models' ability to handle ambiguous information needs
Emphasizes the importance of this capability in real-world applications like customer support and clinical decision-making
Outlines the key components of the CLAMBER benchmark, including a dataset of ambiguous queries and evaluation metrics

Plain English Explanation

CLAMBER is a new tool that helps assess how well large language models (LLMs) can identify and clarify ambiguous information needs. This is an important capability because in many real-world scenarios, people often ask questions or make requests that are unclear or open to multiple interpretations. For example, a customer might reach out for tech support with a vague description of their problem, or a healthcare provider might need to understand a patient's symptoms more precisely.

The CLAMBER benchmark provides a way to test how well LLMs can recognize when information is ambiguous and then ask follow-up questions to get a clearer understanding. This involves analyzing the initial query, identifying potential areas of ambiguity, and then generating clarifying questions to resolve those uncertainties. The benchmark includes a dataset of real-world ambiguous queries that can be used to evaluate LLM performance on this task.

By benchmarking LLMs' ability to handle ambiguity, CLAMBER aims to help advance the development of more capable and reliable language models that can better assist humans in a variety of applications where clear communication is essential.

Technical Explanation

The CLAMBER benchmark consists of a dataset of over 3,000 ambiguous queries across a range of topics, such as customer service, healthcare, and general information-seeking. Each query is annotated with potential areas of ambiguity, along with a set of clarifying questions that could be asked to resolve those uncertainties.

To evaluate an LLM on CLAMBER, the model is first presented with an ambiguous query and tasked with identifying the key ambiguities. It then needs to generate a set of clarifying questions that can help resolve those ambiguities. The model's performance is assessed based on its ability to accurately pinpoint the areas of ambiguity and generate relevant, high-quality clarifying questions.

The CLAMBER dataset was constructed by crowdsourcing ambiguous queries from real-world sources and having human annotators identify the key ambiguities and potential clarifying questions. This approach ensures the benchmark reflects the types of ambiguities that commonly arise in practical applications.

By aligning language models to explicitly handle ambiguity, the CLAMBER benchmark aims to drive progress in developing more capable and reliable LLMs that can better assist humans in a variety of domains where clear communication is crucial.

Critical Analysis

The CLAMBER benchmark represents an important step forward in evaluating and improving LLMs' ability to handle ambiguous information needs. By focusing on a specific, real-world challenge that is often encountered in practical applications, the benchmark provides a valuable tool for assessing the current capabilities of language models and identifying areas for further development.

One potential limitation of the CLAMBER dataset is that it may not capture the full range of ambiguities that can arise in real-world scenarios. The dataset was constructed based on crowdsourced examples, which could miss certain types of ambiguities or biases in the types of queries collected. Additionally, the dataset is primarily focused on English-language queries, so the performance of LLMs on ambiguous information needs in other languages may not be accurately reflected.

Another area for further research could be exploring the relationship between an LLM's ability to handle ambiguity and its overall performance on other language understanding tasks. It would be interesting to see if models that excel at the CLAMBER benchmark also demonstrate stronger capabilities in related areas, such as clinical language understanding or uncertainty-aware reasoning.

Conclusion

The CLAMBER benchmark represents an important advancement in the field of language model evaluation, focusing on the critical ability to identify and clarify ambiguous information needs. By providing a standardized dataset and evaluation metrics, CLAMBER can help drive the development of more capable and reliable LLMs that can better assist humans in a variety of real-world applications where clear communication is essential.

As the use of LLMs continues to expand, the ability to handle ambiguity will become increasingly important. The CLAMBER benchmark provides a valuable tool for researchers and developers to assess and improve this crucial capability, ultimately leading to more effective and trustworthy language models that can better support human decision-making and problem-solving.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

$CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models$

CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wenqiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, Tat-Seng Chua

Large language models (LLMs) are increasingly used to meet user information needs, but their effectiveness in dealing with user queries that contain various types of ambiguity remains unknown, ultimately risking user trust and satisfaction. To this end, we introduce CLAMBER, a benchmark for evaluating LLMs using a well-organized taxonomy. Building upon the taxonomy, we construct ~12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries, even enhanced by chain-of-thought (CoT) and few-shot prompting. These techniques may result in overconfidence in LLMs and yield only marginal enhancements in identifying ambiguity. Furthermore, current LLMs fall short in generating high-quality clarifying questions due to a lack of conflict resolution and inaccurate utilization of inherent knowledge. In this paper, CLAMBER presents a guidance and promotes further research on proactive and trustworthy LLMs. Our dataset is available at https://github.com/zt991211/CLAMBER

6/4/2024

CLIMB: A Benchmark of Clinical Bias in Large Language Models

Yubo Zhang, Shudi Hou, Mingyu Derek Ma, Wei Wang, Muhao Chen, Jieyu Zhao

Large language models (LLMs) are increasingly applied to clinical decision-making. However, their potential to exhibit bias poses significant risks to clinical equity. Currently, there is a lack of benchmarks that systematically evaluate such clinical bias in LLMs. While in downstream tasks, some biases of LLMs can be avoided such as by instructing the model to answer I'm not sure..., the internal bias hidden within the model still lacks deep studies. We introduce CLIMB (shorthand for A Benchmark of Clinical Bias in Large Language Models), a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Notably, for intrinsic bias, we introduce a novel metric, AssocMAD, to assess the disparities of LLMs across multiple demographic groups. Additionally, we leverage counterfactual intervention to evaluate extrinsic bias in a task of clinical diagnosis prediction. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

7/9/2024

ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog

Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, Massimo Poesio

We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics, designed to serve as a strong benchmark for assessing agents' ability to ask clarification questions in task-oriented dialogues. The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents. The scenarios require the seeker to ask questions to resolve uncertainty and gather necessary information to complete tasks. Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider in the benchmark. This allows both current and future seeker agents to test their ability to complete information gathering tasks through dialogue by directly interacting with our provider agent. In tests, LLAMA3.1 405B seeker agent managed a maximum success rate of only 60.05%, showing that ClarQ-LLM presents a strong challenge for future research.

9/17/2024

Aligning Language Models to Explicitly Handle Ambiguity

Hyuhng Joon Kim, Youna Kim, Cheonbok Park, Junyeob Kim, Choonghyun Park, Kang Min Yoo, Sang-goo Lee, Taeuk Kim

In interactions between users and language model agents, user utterances frequently exhibit ellipsis (omission of words or phrases) or imprecision (lack of exactness) to prioritize efficiency. This can lead to varying interpretations of the same input based on different assumptions or background knowledge. It is thus crucial for agents to adeptly handle the inherent ambiguity in queries to ensure reliability. However, even state-of-the-art large language models (LLMs) still face challenges in such scenarios, primarily due to the following hurdles: (1) LLMs are not explicitly trained to deal with ambiguous utterances; (2) the degree of ambiguity perceived by the LLMs may vary depending on the possessed knowledge. To address these issues, we propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns LLMs to manage ambiguous queries by leveraging their own assessment of ambiguity (i.e., perceived ambiguity). Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries while retaining the ability to answer clear questions. Furthermore, our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.

6/18/2024