Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems






Published 4/16/2024 by Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke
Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems


Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.

Create account to get full access


If you already have an account, we'll log you in


  • This paper investigates how the context provided to crowdsourced workers affects their evaluation of task-oriented dialogue system responses.
  • The researchers conducted experiments to understand the impact of context on crowdsourced labels for dialogue system performance.
  • The findings have implications for the use of crowdsourced evaluations in developing and improving task-oriented dialogue systems.

Plain English Explanation

When you ask a digital assistant, like Siri or Alexa, to help you with a task, the assistant uses a "dialogue system" to understand your request and provide a response. To make these dialogue systems better, researchers often use crowdsourcing - getting feedback from many people online - to evaluate the performance of the system.

However, this paper found that the context provided to these crowdsourced workers can significantly impact how they evaluate the dialogue system's responses. In other words, the same response from the system might be judged very differently depending on the information given to the worker.

The researchers ran experiments to explore this effect. They found that when workers were provided with more context about the task and conversation, they tended to give more nuanced and accurate evaluations of the dialogue system's responses. But when they had less context, the workers' judgments were more simplistic and inconsistent.

This is an important finding because many studies rely on crowdsourced evaluations to measure and improve dialogue systems. If the context provided to workers can skew their judgments, it means the feedback may not always reflect the true performance of the system.

The authors suggest that researchers need to carefully design their crowdsourcing setups to ensure workers have sufficient context. This could involve providing more information about the task and conversation history, or even having workers engage in the full dialogue interaction themselves. By doing so, the crowdsourced evaluations will be more reliable and valuable for refining dialogue systems.

Technical Explanation

The paper examines how the context provided to crowdsourced workers affects their evaluation of responses from task-oriented dialogue systems. The researchers conducted a series of experiments where they varied the amount of context given to workers tasked with assessing the quality of dialogue system outputs.

In the first experiment, the workers were shown either the full context of the conversation leading up to the system's response, or only the system's response in isolation. The results showed that workers provided with more context gave more nuanced and consistent evaluations, while those with less context made more simplistic and variable judgments.

A second experiment had workers evaluate responses in the context of the full dialogue history, or with only the immediately preceding turn. Again, the workers demonstrated more sophisticated and reliable assessments when given more contextual information.

The authors argue these findings have important implications for the common practice of using crowdsourced labels to evaluate and improve task-oriented dialogue systems. If worker judgments are heavily influenced by the context they are provided, then the resulting feedback may not accurately reflect the true performance of the dialogue model.

To address this, the paper suggests that researchers should design crowdsourcing setups that give workers sufficient context to make well-informed evaluations. This could involve letting workers engage with the full dialogue interaction, or providing detailed background information about the task and conversation history.

By ensuring crowdsourced workers have appropriate context, the authors believe the resulting labels will be more reliable for guiding the development of high-quality, task-oriented dialogue systems.

Critical Analysis

The paper provides a valuable contribution by highlighting an important methodological consideration for the common use of crowdsourced evaluations in dialogue system research. The findings demonstrate that the context provided to workers can significantly impact their judgments, which is an issue that is often overlooked.

One limitation of the study is that it focuses solely on task-oriented dialogue, and the results may not generalize as clearly to more open-ended conversational settings. It would be worthwhile to investigate whether similar context effects are observed in crowdsourced assessments of chatbots or other dialogue systems.

Additionally, the paper does not delve into the specific mechanisms by which context influences worker evaluations. Further research could explore the cognitive and behavioral factors that lead workers to make different judgments based on the information they are given.

Despite these potential areas for expansion, the core insights of the paper are compelling and warrant serious consideration from the dialogue system research community. Ensuring crowdsourced evaluations are conducted with appropriate context is crucial for developing reliable, high-performing systems that can truly assist users.


This paper makes a strong case that the context provided to crowdsourced workers has a significant impact on how they evaluate the responses of task-oriented dialogue systems. The experiments show that workers given more contextual information tend to provide more nuanced and consistent judgments, compared to those with limited context who make more simplistic evaluations.

These findings have important implications for the common practice of using crowdsourced labels to assess and improve dialogue systems. Researchers must carefully design their crowdsourcing setups to ensure workers have sufficient context, or risk basing system development on unreliable feedback.

By heeding the lessons of this paper, the dialogue system research community can work towards more robust, user-centric evaluations that accurately capture the performance of these increasingly important AI assistants. Ultimately, this will lead to better, more helpful dialogue systems that can truly understand and assist users in their day-to-day tasks and conversations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke





In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.

Read more


Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Livia Qian, Gabriel Skantze





Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

Read more



Investigating Low-Cost LLM Annotation for~Spoken Dialogue Understanding Datasets

Lucas Druart (LIA), Valentin Vielzeuf (LIA), Yannick Est`eve (LIA)





In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.

Read more


The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach





In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.

Read more
