PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Read original: arXiv:2404.05333 - Published 5/10/2024 by Tom'as Os'orio, Bernardo Leite, Henrique Lopes Cardoso, Lu'is Gomes, Jo~ao Rodrigues, Rodrigo Santos, Ant'onio Branco

🧠

Overview

• This paper introduces the PORTULAN ExtraGLUE dataset and models, which aim to kick-start a benchmark for neural processing of the Portuguese language. • The PORTULAN ExtraGLUE dataset includes various Portuguese language understanding tasks, such as text classification, question answering, and natural language inference. • The paper also presents several neural models trained on the PORTULAN ExtraGLUE dataset, providing a strong baseline for future research on Portuguese language processing.

Plain English Explanation

The paper presents a new set of datasets and machine learning models focused on the Portuguese language. The datasets, collectively called PORTULAN ExtraGLUE, cover a variety of language understanding tasks, such as classifying the sentiment of a text, answering questions about a given passage, and determining the logical relationship between two sentences.

These datasets and models are designed to serve as a starting point, or benchmark, for researchers and developers working on natural language processing (NLP) systems for the Portuguese language. [Similar benchmarks exist for other languages, such as the GLUE benchmark for English and the PORO-34B benchmark for multilingual language models.]

By providing a standardized set of tasks and datasets, the PORTULAN ExtraGLUE benchmark can help drive progress in Portuguese NLP, enabling researchers to compare the performance of different models and techniques. This can lead to the development of more accurate and capable Portuguese language processing systems, with applications in areas such as machine translation, virtual assistants, and content analysis.

Technical Explanation

The paper introduces the PORTULAN ExtraGLUE dataset, which consists of several Portuguese language understanding tasks, including:

Text classification: Determining the sentiment (positive, negative, or neutral) of a given text.
Question answering: Answering questions based on a provided passage of text.
Natural language inference: Determining the logical relationship (entailment, contradiction, or neutral) between two sentences.

The authors also present several neural network models trained on the PORTULAN ExtraGLUE dataset, providing strong baseline performance for these tasks. These models include variants of popular architectures like BERT, RoBERTa, and T5, which have been shown to be effective for a wide range of natural language processing problems.

The paper's evaluation of the PORTULAN ExtraGLUE models demonstrates their ability to outperform simpler baselines, such as traditional bag-of-words approaches. This suggests that the PORTULAN ExtraGLUE dataset and models can serve as a valuable resource for advancing the state-of-the-art in Portuguese language understanding.

Critical Analysis

The PORTULAN ExtraGLUE dataset and models presented in this paper represent an important step forward in the development of benchmarks and resources for the Portuguese language. By providing a standardized set of tasks and datasets, the paper addresses a significant gap in the field of natural language processing, where the majority of research has focused on English and other widely-spoken languages.

However, the paper acknowledges several limitations of the current work, including the relatively small size of the PORTULAN ExtraGLUE dataset and the potential for biases in the data. Additionally, the authors note that the dataset and models may not capture the full linguistic diversity of the Portuguese language, which is spoken in various dialects and regions.

Further research is needed to expand the PORTULAN ExtraGLUE benchmark, incorporate more diverse datasets, and explore the performance of language models on a wider range of Portuguese language understanding tasks. [As highlighted in the survey on NLP for dialects and languages, addressing linguistic diversity is a critical challenge in the development of robust and inclusive language processing systems.]

Conclusion

The PORTULAN ExtraGLUE dataset and models presented in this paper represent an important contribution to the field of natural language processing, providing a much-needed benchmark for the Portuguese language. By establishing a standardized set of tasks and datasets, the paper lays the groundwork for future research and development in Portuguese language understanding.

The strong baseline performance of the PORTULAN ExtraGLUE models suggests that they can be leveraged to improve a wide range of applications, from machine translation to virtual assistants and content analysis. [As highlighted in the paper on large language models for spoken language understanding, advances in language processing can have a significant impact on how people interact with technology.]

Overall, the PORTULAN ExtraGLUE project is a promising step towards a more inclusive and representative field of natural language processing, and the authors are to be commended for their efforts in this direction. As the research in this area continues to evolve, it will be essential to further expand and diversify the available resources and benchmarks for the Portuguese language and other underrepresented languages.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Tom'as Os'orio, Bernardo Leite, Henrique Lopes Cardoso, Lu'is Gomes, Jo~ao Rodrigues, Rodrigo Santos, Ant'onio Branco

Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work. Similarly, the respective fine-tuned neural language models, developed with a low-rank adaptation approach, are made available as baselines that can stimulate future work on the neural processing of Portuguese. All datasets and models have been developed and are made available for two variants of Portuguese: European and Brazilian.

5/10/2024

🔄

From Brazilian Portuguese to European Portuguese

Jo~ao Sanches, Rui Ribeiro, Lu'isa Coheur

Brazilian Portuguese and European Portuguese are two varieties of the same language and, despite their close similarities, they exhibit several differences. However, there is a significant disproportion in the availability of resources between the two variants, with Brazilian Portuguese having more abundant resources. This inequity can impact the quality of translation services accessible to European Portuguese speakers. To address this issue, we propose the development of a Brazilian Portuguese to European Portuguese translation system, leveraging recent advancements in neural architectures and models. To evaluate the performance of such systems, we manually curated a gold test set comprising 500 sentences across five different topics. Each sentence in the gold test set has two distinct references, facilitating a straightforward evaluation of future translation models. We experimented with various models by fine-tuning existing Large Language Models using parallel data extracted from movie subtitles and TED Talks transcripts in both Brazilian and European Portuguese. Our evaluation involved the use of conventional automatic metrics as well as a human evaluation. In addition, all models were compared against ChatGPT 3.5 Turbo, which currently yields the best results.

8/15/2024

A Legal Framework for Natural Language Processing Model Training in Portugal

R'uben Almeida, Evelin Amorim

Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.

5/2/2024

Open Generative Large Language Models for Galician

Pablo Gamallo, Pablo Rodr'iguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, Jos'e Ramom Pichel, Marcos Garcia

Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.

6/21/2024