AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis

2404.01210

Published 4/15/2024 by Natalia Grigoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

$AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis$

Abstract

In this paper, we present our team's submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers' baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.

Create account to get full access

Overview

This paper presents the approach of the AILS-NTUA team for SemEval-2024 Task 6: Hallucination Detection and Analysis.
The task involves developing models to identify hallucinations, which are factually incorrect outputs generated by language models.
The key focus of the AILS-NTUA approach is on efficient model tuning to improve hallucination detection performance.

Plain English Explanation

The paper describes the work done by the AILS-NTUA team for a competition called SemEval-2024 Task 6. This task is about detecting hallucinations, which are statements made by AI language models that are factually incorrect. The main idea behind the AILS-NTUA approach is to find an efficient way to fine-tune or adjust their language model to improve its ability to identify these hallucinations.

Technical Explanation

The AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis paper presents the team's approach to the SemEval-2024 Task 6 on hallucination detection and analysis. Hallucinations are instances where language models generate factually incorrect outputs. The key focus of the AILS-NTUA approach is on efficient model tuning to improve hallucination detection performance.

The paper also discusses related work on NLP hallucinations, SHROOM at SemEval-2024 Task 6, SMURFCat at SemEval-2024 Task 6, and MetaCheckGPT: A Multi-Task Hallucination Detector Using LLM.

Critical Analysis

The paper does not provide a detailed critical analysis of the proposed approach or discuss any potential limitations. However, it would be valuable to understand the specific challenges faced in efficient model tuning for hallucination detection and how the AILS-NTUA team addressed them. Additionally, a comparison of the AILS-NTUA approach with other methods mentioned in the related work section could provide more insight into the strengths and weaknesses of the proposed solution.

Conclusion

The AILS-NTUA team presented their approach for SemEval-2024 Task 6, which focused on efficient model tuning to improve hallucination detection. The key idea is to find a way to fine-tune language models more effectively to better identify factually incorrect outputs. While the paper provides an overview of the team's work, a more detailed discussion of the technical approach, its performance, and potential areas for improvement would be valuable for readers interested in this important problem of hallucination detection in natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Timothee Mickus, Elaine Zosa, Ra'ul V'azquez, Teemu Vahtola, Jorg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki

This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling. The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.

4/1/2024

cs.CL

SLPL SHROOM at SemEval-2024 Task 06: A comprehensive study on models ability to detect hallucination

Pouya Fallah, Soroush Gooran, Mohammad Jafarinasab, Pouya Sadeghi, Reza Farnia, Amirreza Tarabkhah, Zainab Sadat Taghavi, Hossein Sameti

Language models, particularly generative models, are susceptible to hallucinations, generating outputs that contradict factual knowledge or the source text. This study explores methods for detecting hallucinations in three SemEval-2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation. We evaluate two methods: semantic similarity between the generated text and factual references, and an ensemble of language models that judge each other's outputs. Our results show that semantic similarity achieves moderate accuracy and correlation scores in trial data, while the ensemble method offers insights into the complexities of hallucination detection but falls short of expectations. This work highlights the challenges of hallucination detection and underscores the need for further research in this critical area.

4/10/2024

cs.CL cs.AI

SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Bradley P. Allen, Fina Polat, Paul Groth

We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.

4/8/2024

cs.CL cs.AI

SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection

Elisei Rykov, Yana Shishkina, Kseniia Petrushina, Kseniia Titova, Sergey Petrakov, Alexander Panchenko

In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.

4/10/2024

cs.CL cs.AI