SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window

2309.08832

Published 4/3/2024 by Vikas Raunak, Tom Kocmi, Matt Post

👁️

Abstract

Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. In this paper, we investigate whether additional source context can effectively substitute for a reference. We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations.

Create account to get full access

Overview

Traditional sentence-level quality estimation metrics often outperform reference-based metrics, as references help resolve ambiguities in the source text.
This paper investigates whether additional source context can substitute for a reference, allowing for high-quality reference-free document-level evaluation.
The authors present a new metric called SLIDE (SLIding Document Evaluator) that operates on blocks of sentences rather than individual sentences.

Plain English Explanation

When evaluating the quality of machine translations, researchers have found that metrics that compare the translated text to a human-written reference often perform better than metrics that only look at the source text and the machine translation. This makes sense, as the reference can help resolve any ambiguities or uncertainties in the original source text.

However, relying on a reference has its own challenges - you need to have a human-written version of the translated text available, which isn't always the case. This paper explores whether we can get the same benefits as using a reference by instead looking at the broader context around the text being translated.

The researchers developed a new metric called SLIDE that operates on blocks of sentences rather than individual sentences. SLIDE uses a "sliding window" to analyze chunks of the document, feeding those chunks into an off-the-shelf quality estimation model. The key idea is that the broader context provided by looking at sentence blocks can potentially substitute for the information a reference would provide, helping to resolve ambiguities in the source text.

Technical Explanation

The paper presents a new metric called SLIDE (SLIding Document Evaluator) that aims to leverage source context to achieve reference-level performance without requiring a human reference translation.

SLIDE operates by breaking the document into overlapping blocks of sentences, which are then fed into an off-the-shelf quality estimation model. This allows SLIDE to capture contextual information beyond just the individual sentence being evaluated.

The authors evaluate SLIDE against both sentence-level quality estimation baselines as well as reference-based metrics. They find that SLIDE significantly outperforms the sentence-level baselines, and in some cases even eliminates the gap with reference-based metrics. This suggests that the additional source context provided by the sliding window approach can effectively substitute for a human reference in disambiguating the source text.

The authors note that this finding is particularly relevant for reference-free document-level evaluation, where SLIDE could provide higher-quality assessments of machine translation systems without requiring the overhead of obtaining human reference translations.

Critical Analysis

The paper provides a compelling demonstration that leveraging source context can improve the performance of reference-free quality estimation, potentially eliminating the need for human references in some cases.

However, the authors acknowledge that their experiments are limited to a specific dataset and quality estimation model. Further research is needed to assess how generalizable these findings are, and whether SLIDE maintains its advantages across different domains, languages, and MT system architectures.

Additionally, the authors do not provide a detailed analysis of the types of ambiguities or errors that SLIDE is able to resolve compared to sentence-level metrics. A deeper understanding of the specific failures of sentence-level approaches that SLIDE addresses would strengthen the technical contribution.

Finally, the authors note that SLIDE still requires document boundary annotations, which may not always be available. Exploring methods to relax this requirement or automatically infer document structure could further improve the practical applicability of the approach.

Conclusion

This paper presents a novel approach to reference-free document-level machine translation evaluation through the SLIDE metric. By leveraging broader source context beyond individual sentences, SLIDE is able to significantly outperform traditional sentence-level quality estimation and in some cases even match the performance of reference-based metrics.

These findings suggest that source context can effectively substitute for human references in resolving ambiguities, with important implications for reference-free evaluation. If SLIDE's advantages hold across a wider range of scenarios, it could provide a more practical and cost-effective alternative to reference-based evaluation, benefiting both MT research and deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Kun Zhao, Bohao Yang, Chen Tang, Chenghua Lin, Liang Zhan

The long-standing one-to-many problem of gold standard responses in open-domain dialogue systems presents challenges for automatic evaluation metrics. Though prior works have demonstrated some success by applying powerful Large Language Models (LLMs), existing approaches still struggle with the one-to-many problem, and exhibit subpar performance in domain-specific scenarios. We assume the commonsense reasoning biases within LLMs may hinder their performance in domainspecific evaluations. To address both issues, we propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation), that leverages both a small, specialised model (SLM), and LLMs for the evaluation of open domain dialogues. Our approach introduces several techniques: (1) Contrastive learning to differentiate between robust and non-robust response embeddings; (2) A novel metric for semantic sensitivity that combines embedding cosine distances with similarity learned through neural networks, and (3) a strategy for incorporating the evaluation results from both the SLM and LLMs. Our empirical results demonstrate that our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE evaluator exhibits better correlation with human judgements. Our code is available at https:// github.com/hegehongcha/SLIDE-ACL2024.

5/31/2024

cs.CL

Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages

Vandan Mujadia, Pruthwik Mishra, Arafat Ahsan, Dipti Misra Sharma

With the primary focus on evaluating the effectiveness of large language models for automatic reference-less translation assessment, this work presents our experiments on mimicking human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evaluation task where we performed zero-shot learning, in-context example-driven learning, and fine-tuning of large language models to provide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA-2-13B) achieves a comparable or higher overall correlation with human judgments for the considered Indian language pairs.

4/4/2024

cs.CL

🤷

Escaping the sentence-level paradigm in machine translation

Matt Post, Marcin Junczys-Dowmunt

It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$rightarrow$EN, EN$rightarrow$DE, EN$rightarrow$FR, and EN$rightarrow$RU) establish the success of these three pieces together in improving document-level performance.

5/17/2024

cs.CL

Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen, Mengxia Yu, Yun Huang, Meng Jiang

Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.

6/18/2024

cs.CL cs.AI cs.LG