Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer

2404.16506

Published 4/26/2024 by Youmi Ma, An Wang, Naoaki Okazaki

⛏️

Abstract

Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.

Create account to get full access

Overview

This paper focuses on the task of Document-level Relation Extraction (DocRE), which involves extracting all semantic relationships from a document.
While DocRE research has been conducted primarily on the English language, the authors explore effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case.
The authors construct a dataset by transferring an English dataset to Japanese, but models trained on this dataset suffer from low recall.
To address this, the authors explore whether the transferred dataset can assist human annotation on Japanese documents, where annotators edit relation predictions from a model trained on the transferred dataset.

Plain English Explanation

The paper examines the challenge of extracting meaningful relationships from documents, a task known as Document-level Relation Extraction (DocRE). While DocRE research has focused mainly on English, the authors investigate how to leverage existing English resources to enable DocRE in other languages, specifically Japanese.

As an initial step, the researchers create a Japanese DocRE dataset by translating an existing English dataset. However, they find that models trained on this translated dataset struggle to accurately identify relationships in the data. The authors attribute this to differences between the surface structure and semantics of documents written by native Japanese speakers versus those translated from English.

To overcome this challenge, the researchers explore a new approach where human annotators edit the relationship predictions made by a model trained on the translated dataset. Their analysis shows that these model-generated suggestions significantly reduce the amount of editing required by the annotators, cutting the work by around 50%.

The paper also includes experiments that assess the performance of existing DocRE models on the collected Japanese dataset, highlighting the unique challenges of DocRE in this language and across languages (known as cross-lingual DocRE).

Technical Explanation

The authors first construct a Japanese DocRE dataset by transferring an existing English dataset, DWIE, to Japanese using machine translation. However, models trained on this translated dataset suffer from low recall, meaning they fail to identify many of the actual relationships in the data.

To investigate this issue, the authors analyze the error cases and find that the low performance is due to differences in the surface structures and semantics of documents written by native Japanese speakers versus those translated from English. The translated documents often have more complex sentence structures and different ways of expressing relationships compared to original Japanese texts.

To address this challenge, the authors explore a novel approach where human annotators edit relation predictions made by a model trained on the transferred dataset. They find that these model-generated suggestions help reduce the amount of editing required by the annotators by approximately 50% compared to the previous approach of having annotators create the relationships from scratch.

The paper also includes experiments that quantify the performance of existing DocRE models on the collected Japanese dataset. These results highlight the difficulties of DocRE in the Japanese language and the challenges of cross-lingual DocRE, where models must handle documents in different languages.

Critical Analysis

The paper presents a valuable contribution to the field of DocRE by exploring the challenges of applying this task to non-English languages, specifically Japanese. The authors' insights into the differences between translated and native-written documents provide important context for understanding the limitations of the transferred dataset approach.

However, the paper could be strengthened by a more in-depth discussion of the potential biases or errors introduced by the machine translation process. It would also be interesting to see the authors explore other techniques for cross-lingual knowledge transfer, such as multilingual pre-training or zero-shot learning, to improve the performance of DocRE models on Japanese data.

Additionally, the authors could delve deeper into the specific types of relationships that proved most challenging to extract in the Japanese dataset. This level of granularity could provide valuable guidance for future work in improving DocRE models for non-English languages.

Conclusion

This paper tackles the important challenge of expanding Document-level Relation Extraction (DocRE) research beyond the English language. By focusing on Japanese as a representative case, the authors demonstrate the limitations of simply transferring existing English datasets and models to non-English contexts.

The authors' proposed approach of using model-generated suggestions to assist human annotators represents a promising direction for leveraging existing resources to accelerate DocRE studies in underrepresented languages. The insights gained from this work can inform future efforts to develop robust, multilingual DocRE capabilities and promote more inclusive advancements in natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⛏️

Knowledge-Driven Cross-Document Relation Extraction

Monika Jain, Raghava Mutharaju, Kuldeep Singh, Ramakanth Kavuluru

Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task. However, a handful of recent efforts explore it across documents or in the cross-document setting (CrossDocRE). This is distinct from the single document case because different documents often focus on disparate themes, while text within a document tends to have a single goal. Linking findings from disparate documents to identify new relationships is at the core of the popular literature-based knowledge discovery paradigm in biomedicine and other domains. Current CrossDocRE efforts do not consider domain knowledge, which are often assumed to be known to the reader when documents are authored. Here, we propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE. Our proposed framework has three main benefits over baselines: 1) it incorporates domain knowledge of entities along with documents' text; 2) it offers interpretability by producing explanatory text for predicted relations between entities 3) it improves performance over the prior methods.

6/19/2024

cs.CL cs.IR

⚙️

A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers

Xiaoyan Zhao, Yang Deng, Min Yang, Lingzhi Wang, Rui Zhang, Hong Cheng, Wai Lam, Ying Shen, Ruifeng Xu

Relation extraction (RE) involves identifying the relations between entities from underlying content. RE serves as the foundation for many natural language processing (NLP) and information retrieval applications, such as knowledge graph completion and question answering. In recent years, deep neural networks have dominated the field of RE and made noticeable progress. Subsequently, the large pre-trained language models have taken the state-of-the-art RE to a new level. This survey provides a comprehensive review of existing deep learning techniques for RE. First, we introduce RE resources, including datasets and evaluation metrics. Second, we propose a new taxonomy to categorize existing works from three perspectives, i.e., text representation, context encoding, and triplet prediction. Third, we discuss several important challenges faced by RE and summarize potential techniques to tackle these challenges. Finally, we outline some promising future directions and prospects in this field. This survey is expected to facilitate researchers' collaborative efforts to address the challenges of real-world RE systems.

6/26/2024

cs.CL cs.AI

Reward-based Input Construction for Cross-document Relation Extraction

Byeonghu Na, Suhyeon Jo, Yeongmin Kim, Il-Chul Moon

Relation extraction (RE) is a fundamental task in natural language processing, aiming to identify relations between target entities in text. While many RE methods are designed for a single sentence or document, cross-document RE has emerged to address relations across multiple long documents. Given the nature of long documents in cross-document RE, extracting document embeddings is challenging due to the length constraints of pre-trained language models. Therefore, we propose REward-based Input Construction (REIC), the first learning-based sentence selector for cross-document RE. REIC extracts sentences based on relational evidence, enabling the RE module to effectively infer relations. Since supervision of evidence sentences is generally unavailable, we train REIC using reinforcement learning with RE prediction scores as rewards. Experimental results demonstrate the superiority of our method over heuristic methods for different RE structures and backbones in cross-document RE. Our code is publicly available at https://github.com/aailabkaist/REIC.

6/3/2024

cs.CL cs.LG

On the Robustness of Document-Level Relation Extraction Models to Entity Name Variations

Shiao Meng, Xuming Hu, Aiwei Liu, Fukun Ma, Yawen Yang, Shuang Li, Lijie Wen

Driven by the demand for cross-sentence and large-scale relation extraction, document-level relation extraction (DocRE) has attracted increasing research interest. Despite the continuous improvement in performance, we find that existing DocRE models which initially perform well may make more mistakes when merely changing the entity names in the document, hindering the generalization to novel entity names. To this end, we systematically investigate the robustness of DocRE models to entity name variations in this work. We first propose a principled pipeline to generate entity-renamed documents by replacing the original entity names with names from Wikidata. By applying the pipeline to DocRED and Re-DocRED datasets, we construct two novel benchmarks named Env-DocRED and Env-Re-DocRED for robustness evaluation. Experimental results show that both three representative DocRE models and two in-context learned large language models consistently lack sufficient robustness to entity name variations, particularly on cross-sentence relation instances and documents with more entities. Finally, we propose an entity variation robust training method which not only improves the robustness of DocRE models but also enhances their understanding and reasoning capabilities. We further verify that the basic idea of this method can be effectively transferred to in-context learning for DocRE as well.

6/12/2024

cs.CL