ACE-2005-PT: Corpus for Event Extraction in Portuguese

Read original: arXiv:2408.16928 - Published 9/2/2024 by Lu'is Filipe Cunha, Purificac{c}~ao Silvano, Ricardo Campos, Al'ipio Jorge

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Overview

The paper presents a new corpus for event extraction in Portuguese, called ACE-2005-PT.
This corpus is based on the English ACE-2005 corpus and has been translated and annotated for Portuguese.
The corpus includes annotations for event mentions, their arguments, and other related entities.

Plain English Explanation

The researchers have created a new dataset called ACE-2005-PT that can be used to train and evaluate systems for extracting events from Portuguese text. This dataset is based on the existing ACE-2005 corpus, which was originally developed for English. The team translated the documents from English to Portuguese and then had them annotated to identify the different events mentioned, the participants or "arguments" of those events, and other relevant entities.

Having a high-quality, annotated dataset like this is important for advancing natural language processing capabilities in Portuguese. It provides a common benchmark that researchers and developers can use to measure progress on the task of event extraction - identifying who did what, when, and where from textual data. This can have applications in areas like question answering, information retrieval, and knowledge graph construction.

Technical Explanation

The paper describes the process of creating the ACE-2005-PT corpus. The starting point was the existing ACE-2005 corpus, which contains English newswire, broadcast news, and discussion forum documents annotated for various types of events, entities, and relations. The researchers translated these documents into Portuguese and then had them re-annotated by native Portuguese speakers following the same annotation guidelines as the original English corpus.

The resulting ACE-2005-PT corpus contains 329 Portuguese documents with 4,589 event mentions annotated. The events are categorized into 33 different types, such as "Attack", "Meet", and "Transfer-Ownership". For each event, the corpus also identifies the participants or "arguments" (e.g. the perpetrator, target, time, and location of an attack event).

The researchers performed various quality assurance checks on the translations and annotations, comparing them to the original English versions. They found high levels of agreement, indicating that the corpus accurately reflects the original English data in Portuguese.

Critical Analysis

The creation of the ACE-2005-PT corpus is a valuable contribution to the field of natural language processing for Portuguese. Having a high-quality, annotated dataset for event extraction provides an important benchmark that can drive progress in areas like question answering and knowledge graph construction.

However, one potential limitation of the corpus is that it is based on news-related text, which may not be representative of all types of Portuguese language use. It would be helpful to have additional Portuguese language corpora covering a broader range of domains and genres to further advance NLP capabilities.

Additionally, the corpus only covers a subset of the event types and relations annotated in the original English ACE-2005 corpus. Expanding the coverage to match the full set of annotations in the English version could make the Portuguese corpus even more useful for cross-lingual comparisons and transfer learning.

Conclusion

The ACE-2005-PT corpus represents an important step forward in developing high-quality natural language processing resources for the Portuguese language. By providing a common benchmark for event extraction, this dataset can help drive progress in areas like question answering, information retrieval, and knowledge graph construction. Further expansion of the corpus to cover additional genres and event types could make it an even more valuable resource for the NLP community.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Lu'is Filipe Cunha, Purificac{c}~ao Silvano, Ricardo Campos, Al'ipio Jorge

Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.

9/2/2024

⛏️

Event Extraction for Portuguese: A QA-driven Approach using ACE-2005

Lu'is Filipe Cunha, Ricardo Campos, Al'ipio Jorge

Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.

9/2/2024

Event-Arguments Extraction Corpus and Modeling using BERT for Arabic

Alaa Aljabari, Lina Duaibes, Mustafa Jarrar, Mohammed Khalilia

Event-argument extraction is a challenging task, particularly in Arabic due to sparse linguistic resources. To fill this gap, we introduce the hadath corpus ($550$k tokens) as an extension of Wojood, enriched with event-argument annotations. We used three types of event arguments: $agent$, $location$, and $date$, which we annotated as relation types. Our inter-annotator agreement evaluation resulted in $82.23%$ $Kappa$ score and $87.2%$ $F_1$-score. Additionally, we propose a novel method for event relation extraction using BERT, in which we treat the task as text entailment. This method achieves an $F_1$-score of $94.01%$. To further evaluate the generalization of our proposed method, we collected and annotated another out-of-domain corpus (about $80$k tokens) called testNLI and used it as a second test set, on which our approach achieved promising results ($83.59%$ $F_1$-score). Last but not least, we propose an end-to-end system for event-arguments extraction. This system is implemented as part of SinaTools, and both corpora are publicly available at {small url{https://sina.birzeit.edu/wojood}}

8/1/2024

🧠

PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Tom'as Os'orio, Bernardo Leite, Henrique Lopes Cardoso, Lu'is Gomes, Jo~ao Rodrigues, Rodrigo Santos, Ant'onio Branco

Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work. Similarly, the respective fine-tuned neural language models, developed with a low-rank adaptation approach, are made available as baselines that can stimulate future work on the neural processing of Portuguese. All datasets and models have been developed and are made available for two variants of Portuguese: European and Brazilian.

5/10/2024