SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

2404.03732

Published 4/8/2024 by Bradley P. Allen, Fina Polat, Paul Groth

SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Abstract

We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.

Create account to get full access

Overview

This paper presents the SHROOM-INDElab's approach to SemEval-2024 Task 6, which focuses on zero- and few-shot language model-based classification for detecting hallucination in text.
The task involves identifying whether a given text passage contains hallucinated content, which refers to the generation of factually incorrect information.
The SHROOM-INDElab team explored the use of large language models (LLMs) in a zero- and few-shot setting to tackle this challenge.

Plain English Explanation

The paper describes a research project aimed at detecting hallucination in text using large language models in a zero- and few-shot learning approach. Hallucination refers to the generation of information that is factually incorrect or inconsistent with the real world.

The researchers from the SHROOM-INDElab participated in a competition called SemEval-2024 Task 6, where the goal was to develop a system that could accurately identify whether a given text passage contained hallucinated content. The team explored the use of powerful language models that can be trained on limited data (zero- and few-shot learning) to tackle this challenge.

The key idea is to leverage the capabilities of large language models, which have been trained on vast amounts of data, to recognize when a text passage contains information that deviates from what is typically seen in real-world contexts. By using these models in a zero- or few-shot setting, the researchers aimed to create a system that can quickly adapt to new tasks and data, without requiring extensive retraining.

Technical Explanation

The SHROOM-INDElab's approach to SemEval-2024 Task 6 involved the use of large language models in a zero- and few-shot learning setting to classify text passages as containing hallucinated content or not.

The researchers experimented with different LLM-based architectures, including prompting techniques and fine-tuning strategies, to tackle the task. By leveraging the rich semantic knowledge and language understanding capabilities of these models, the team sought to develop a system that could accurately identify hallucination without requiring extensive training on labeled data.

The key technical innovations explored in this work include:

Prompt engineering: The team investigated various prompting approaches to effectively convey the hallucination detection task to the LLMs, allowing the models to leverage their natural language understanding capabilities.
Few-shot fine-tuning: In addition to zero-shot approaches, the researchers explored fine-tuning the LLMs on a small amount of labeled data to further enhance their performance on the task.
Ensemble methods: The team experimented with combining the outputs of multiple LLM-based classifiers to improve the overall robustness and accuracy of the hallucination detection system.

Through these technical innovations, the SHROOM-INDElab aimed to push the boundaries of what is possible with LLM-based approaches in the context of hallucination detection and contribute to the broader understanding of how large language models can be leveraged for challenging natural language processing tasks.

Critical Analysis

The paper presents a promising approach to hallucination detection using LLM-based techniques in a zero- and few-shot learning setting. The researchers' focus on prompt engineering and ensemble methods is a valuable contribution, as it highlights the importance of carefully designing the interface between the task and the language model to maximize its performance.

However, the paper does not provide a detailed evaluation of the limitations and potential caveats of the proposed approach. For instance, it would be helpful to understand how the system performs on different types of hallucinated content, such as those involving factual inaccuracies, logical inconsistencies, or contextual inappropriateness. Additionally, the paper could have discussed the potential biases or pitfalls that may arise when relying on LLMs for this type of task, and how the researchers addressed these concerns.

Furthermore, the paper could have explored the generalizability of the approach beyond the SemEval-2024 Task 6 dataset. It would be interesting to see how the SHROOM-INDElab's techniques perform on other hallucination detection benchmarks or in real-world applications where the distribution of hallucinated content may differ.

Despite these potential areas for improvement, the paper presents a valuable contribution to the growing body of research on leveraging large language models for challenging natural language processing tasks, such as hallucination detection. The team's focus on zero- and few-shot learning approaches is particularly relevant, as it speaks to the desire for more sample-efficient and adaptable AI systems.

Conclusion

The SHROOM-INDElab's work on SemEval-2024 Task 6 showcases the potential of using large language models in a zero- and few-shot learning setting for the task of hallucination detection. By exploring prompt engineering and ensemble techniques, the researchers demonstrated how these powerful models can be effectively leveraged to identify factually incorrect information in text.

While the paper could have delved deeper into the limitations and broader implications of the approach, it nonetheless contributes to the ongoing efforts to develop more robust and adaptable natural language processing systems. As the field of hallucination detection continues to evolve, the SHROOM-INDElab's work provides valuable insights and a foundation for further exploration in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

$AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis$

AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

Natalia Grigoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

In this paper, we present our team's submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers' baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.

4/15/2024

cs.CL

SLPL SHROOM at SemEval-2024 Task 06: A comprehensive study on models ability to detect hallucination

Pouya Fallah, Soroush Gooran, Mohammad Jafarinasab, Pouya Sadeghi, Reza Farnia, Amirreza Tarabkhah, Zainab Sadat Taghavi, Hossein Sameti

Language models, particularly generative models, are susceptible to hallucinations, generating outputs that contradict factual knowledge or the source text. This study explores methods for detecting hallucinations in three SemEval-2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation. We evaluate two methods: semantic similarity between the generated text and factual references, and an ensemble of language models that judge each other's outputs. Our results show that semantic similarity achieves moderate accuracy and correlation scores in trial data, while the ensemble method offers insights into the complexities of hallucination detection but falls short of expectations. This work highlights the challenges of hallucination detection and underscores the need for further research in this critical area.

4/10/2024

cs.CL cs.AI

SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Timothee Mickus, Elaine Zosa, Ra'ul V'azquez, Teemu Vahtola, Jorg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki

This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.

4/1/2024

cs.CL

SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection

Elisei Rykov, Yana Shishkina, Kseniia Petrushina, Kseniia Titova, Sergey Petrakov, Alexander Panchenko

In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.

4/10/2024

cs.CL cs.AI