TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

Read original: arXiv:2405.02637 - Published 5/7/2024 by Mohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, Jeffery Dalton, Leif Azzopardi

TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

Overview

A new test collection called TREC iKAT 2023 for evaluating conversational and interactive knowledge assistants
Designed to assess the performance of AI-powered systems that can engage in natural language conversations and assist users with information-seeking tasks
Aims to advance the state-of-the-art in conversational information seeking and interactive knowledge access

Plain English Explanation

TREC iKAT 2023 is a new dataset that researchers can use to test and compare the capabilities of conversational AI systems. These are AI-powered virtual assistants that can engage in back-and-forth conversations with users and help them find information.

The goal of this dataset is to provide a standardized way to evaluate the performance of these conversational knowledge assistants. It includes a variety of realistic scenarios and tasks that users might want help with, like answering questions or getting product recommendations. Researchers can use the dataset to test how well their AI systems understand natural language, engage in interactive dialogues, and provide relevant and helpful information to users.

By having a common benchmark like TREC iKAT 2023, the research community can make progress in developing more capable and user-friendly conversational AI assistants.

Technical Explanation

The TREC iKAT 2023 dataset was created to address the need for standardized evaluation of conversational and interactive knowledge-seeking AI systems. It consists of a large number of natural language conversations between users and an intelligent assistant, covering a diverse range of information-seeking tasks and topics.

The conversations were collected from crowdsourced interactions, where human participants were instructed to engage with a virtual assistant to accomplish various goals, such as answering questions, finding relevant information, or making decisions. The resulting dialogues were then carefully annotated by human raters to provide detailed insights into the quality and effectiveness of the conversational interactions.

Key features of the TREC iKAT 2023 dataset include:

Diverse Topics and Scenarios: The conversations cover a wide range of subject areas, from general knowledge to specialized domains, reflecting the breadth of information needs that users might have when interacting with a conversational AI assistant.
Interactive and Contextual: The dialogues capture the back-and-forth nature of conversational interactions, with users posing follow-up questions, clarifying their intents, and refining their information needs over the course of the interaction.
Comprehensive Annotations: The dataset includes detailed annotations on various aspects of the conversational interactions, such as the user's intent, the assistant's response quality, the level of engagement, and the overall task completion.

By providing this rich and diverse dataset, TREC iKAT 2023 aims to enable more robust and meaningful evaluations of conversational AI systems, helping to drive progress in the field of conversational information seeking and interactive knowledge access.

Critical Analysis

The TREC iKAT 2023 dataset represents a significant step forward in the evaluation of conversational AI systems, but it is important to consider some potential limitations and areas for further research:

Reliance on Crowdsourcing: While the use of crowdsourcing allows for the collection of a large and diverse set of conversations, there may be inherent biases or inconsistencies in the way users interact with the virtual assistant, which could impact the generalizability of the results.
Lack of Real-World Deployment: The conversations in the dataset were generated in a controlled, laboratory-like setting, which may not fully capture the nuances and challenges of real-world deployments of conversational AI systems, where users may have different expectations, backgrounds, and technical proficiencies.
Potential for Outdated or Biased Knowledge: The knowledge base and information sources used by the conversational AI systems in the dataset may not always reflect the latest developments or may perpetuate societal biases, which could influence the performance and perceived usefulness of the systems.

To address these limitations, future research could explore ways to incorporate more diverse and representative user populations, as well as explore methods for evaluating the long-term performance and adaptation of conversational AI systems in real-world interactive dialogue scenarios.

Conclusion

The TREC iKAT 2023 dataset represents an important contribution to the field of conversational AI, providing a robust and standardized platform for evaluating the performance of interactive knowledge assistants. By offering a diverse set of conversational scenarios and comprehensive annotations, the dataset can help drive the development of more capable and user-friendly conversational information seeking systems, ultimately benefiting the broader public through enhanced access to information and improved decision-making support.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

Mohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, Jeffery Dalton, Leif Azzopardi

Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.

5/7/2024

How to Leverage Personal Textual Knowledge for Personalized Conversational Information Retrieval

Fengran Mo, Longxiang Zhao, Kaiyu Huang, Yue Dong, Degen Huang, Jian-Yun Nie

Personalized conversational information retrieval (CIR) combines conversational and personalizable elements to satisfy various users' complex information needs through multi-turn interaction based on their backgrounds. The key promise is that the personal textual knowledge base (PTKB) can improve the CIR effectiveness because the retrieval results can be more related to the user's background. However, PTKB is noisy: not every piece of knowledge in PTKB is relevant to the specific query at hand. In this paper, we explore and test several ways to select knowledge from PTKB and use it for query reformulation by using a large language model (LLM). The experimental results show the PTKB might not always improve the search results when used alone, but LLM can help generate a more appropriate personalized query when high-quality guidance is provided.

7/24/2024

LLM-Based Open-Domain Integrated Task and Knowledge Assistants with Programmable Policies

Harshit Joshi, Shicheng Liu, James Chen, Robert Weigle, Monica S. Lam

Programming LLM-based knowledge and task assistants that faithfully conform to developer-provided policies is challenging. These agents must retrieve and provide consistent, accurate, and relevant information to address user's queries and needs. Yet such agents generate unfounded responses (hallucinate). Traditional dialogue trees can only handle a limited number of conversation flows, making them inherently brittle. To this end, we present KITA - a programmable framework for creating task-oriented conversational agents that are designed to handle complex user interactions. Unlike LLMs, KITA provides reliable grounded responses, with controllable agent policies through its expressive specification, KITA Worksheet. In contrast to dialog trees, it is resilient to diverse user queries, helpful with knowledge sources, and offers ease of programming policies through its declarative paradigm. Through a real-user study involving 62 participants, we show that KITA beats the GPT-4 with function calling baseline by 26.1, 22.5, and 52.4 points on execution accuracy, dialogue act accuracy, and goal completion rate, respectively. We also release 22 real-user conversations with KITA manually corrected to ensure accuracy.

7/9/2024

Generate then Retrieve: Conversational Response Retrieval Using LLMs as Answer and Query Generators

Zahra Abbasiantaeb, Mohammad Aliannejadi

CIS is a prominent area in IR which focuses on developing interactive knowledge assistants. These systems must adeptly comprehend the user's information requirements within the conversational context and retrieve the relevant information. To this aim, the existing approaches model the user's information needs by generating a single query rewrite or a single representation of the query in the query space embedding. However, to answer complex questions, a single query rewrite or representation is often ineffective. To address this, a system needs to do reasoning over multiple passages. In this work, we propose using a generate-then-retrieve approach to improve the passage retrieval performance for complex user queries. In this approach, we utilize large language models (LLMs) to (i) generate an initial answer to the user's information need by doing reasoning over the context of the conversation, and (ii) ground this answer to the collection. Based on the experiments, our proposed approach significantly improves the retrieval performance on TREC iKAT 23, TREC CAsT 20 and 22 datasets, under various setups. Also, we show that grounding the LLM's answer requires more than one searchable query, where an average of 3 queries outperforms human rewrites.

6/27/2024