MiTTenS: A Dataset for Evaluating Gender Mistranslation

Read original: arXiv:2401.06935 - Published 8/15/2024 by Kevin Robinson, Sneha Kudugunta, Romina Stella, Sunipa Dev, Jasmijn Bastings

MiTTenS: A Dataset for Evaluating Gender Mistranslation

Overview

The paper proposes MiTTenS, a dataset for evaluating misgendering in machine translation systems.
MiTTenS contains sentences in multiple languages with carefully curated gender information.
The dataset aims to help researchers assess and mitigate gender bias in translation models.

Plain English Explanation

The paper introduces MiTTenS, a new dataset designed to help evaluate how well machine translation systems handle gender-related information. When translating between languages, it's common for the gender of a person or pronoun to get lost or changed incorrectly. This is known as "misgendering" and can be problematic, especially when translating content involving marginalized groups.

MiTTenS provides a collection of sentences in multiple languages, each with carefully annotated gender information. By using this dataset, researchers can assess how well different translation models handle gender and identify areas for improvement. The goal is to help develop more gender-fair and inclusive machine translation technology.

Technical Explanation

The paper describes the creation of the MiTTenS dataset, which consists of sentences in 7 languages (English, German, Spanish, French, Italian, Polish, and Russian) with annotated gender information. The dataset contains "gender sets" - groups of sentences that differ only in the gender of the person or pronoun involved.

For example, a gender set might include the English sentences "He went to the store" and "She went to the store," allowing researchers to evaluate how a translation model handles the gender difference. The dataset covers a range of domains, from general conversations to professional settings, to provide a comprehensive test bed for evaluating misgendering in translations.

The authors also provide baseline results using several popular machine translation models, demonstrating the existence of gender bias and the need for further research in this area.

Critical Analysis

The MiTTenS dataset represents a valuable contribution to the field of fair and inclusive machine translation. By providing a standardized, multi-lingual benchmark, the authors enable researchers to systematically evaluate and compare the performance of different translation models in handling gender information.

One potential limitation is that the dataset may not capture all nuances of gender identity and expression, as it focuses primarily on the binary male/female distinction. Further research could explore expanding the dataset to include a wider range of gender identities and expressions.

Additionally, the paper does not delve deeply into the potential societal impacts of misgendering in machine translation. As these systems become more widely used, it is crucial to consider the implications for marginalized communities and ways to ensure equitable representation and treatment.

Conclusion

The MiTTenS dataset represents an important step towards developing more gender-fair and inclusive machine translation systems. By providing a standardized benchmark, the authors enable researchers to systematically evaluate and improve the ability of translation models to handle gender information accurately. As machine translation becomes increasingly ubiquitous, addressing issues of misgendering is crucial for promoting equity and representation in language technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MiTTenS: A Dataset for Evaluating Gender Mistranslation

Kevin Robinson, Sneha Kudugunta, Romina Stella, Sunipa Dev, Jasmijn Bastings

Translation systems, including foundation models capable of translation, can produce errors that result in gender mistranslation, and such errors can be especially harmful. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, including several traditionally under-represented in digital resources. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both neural machine translation systems and foundation models, and show that all systems exhibit gender mistranslation and potential harm, even in high resource languages.

8/15/2024

Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German

Manuel Lardelli, Giuseppe Attanasio, Anne Lauscher

The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of other genders, such as women and non-binary people. To counteract gender discrimination, a societal movement towards using gender-fair language exists (e.g., by adopting neosystems). However, gender-fair German is currently barely supported in machine translation (MT), requiring post-editing or manual translations. We address this research gap by studying gender-fair language in English-to-German MT. Concretely, we enrich a community-created gender-fair language dictionary and sample multi-sentence test instances from encyclopedic text and parliamentary speeches. Using these novel resources, we conduct the first benchmark study involving two commercial systems and six neural MT models for translating words in isolation and natural contexts across two domains. Our findings show that most systems produce mainly masculine forms and rarely gender-neutral variants, highlighting the need for future research. We release code and data at https://github.com/g8a9/building-bridges-gender-fair-german-mt.

6/11/2024

Generating Gender Alternatives in Machine Translation

Sarthak Garg, Mozhdeh Gheini, Clara Emmanuel, Tatiana Likhomanenko, Qin Gao, Matthias Paulik

Machine translation (MT) systems often translate terms with ambiguous gender (e.g., English term the nurse) into the gendered form that is most prevalent in the systems' training data (e.g., enfermera, the Spanish term for a female nurse). This often reflects and perpetuates harmful stereotypes present in society. With MT user interfaces in mind that allow for resolving gender ambiguity in a frictionless manner, we study the problem of generating all grammatically correct gendered translation alternatives. We open source train and test datasets for five language pairs and establish benchmarks for this task. Our key technical contribution is a novel semi-supervised solution for generating alternatives that integrates seamlessly with standard MT models and maintains high performance without requiring additional components or increasing inference overhead.

7/31/2024

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs

Aleix Sant, Carlos Escolano, Audrey Mash, Francesca De Luca Fornaciari, Maite Melero

This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En $rightarrow$ Ca) and English to Spanish (En $rightarrow$ Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems.

7/29/2024