IT5: Text-to-text Pretraining for Italian Language Understanding and Generation

Read original: arXiv:2203.03759 - Published 5/21/2024 by Gabriele Sarti, Malvina Nissim

💬

Overview

Researchers introduce IT5, the first family of encoder-decoder transformer models specifically trained on Italian language data.
They document a thorough cleaning process for a large Italian corpus, which is then used to pretrain four different sizes of the IT5 model.
The researchers also introduce the ItaGen benchmark, a collection of natural language understanding and generation tasks for evaluating Italian language models.
They find that the monolingual IT5 models outperform multilingual baselines, setting a new state-of-the-art for Italian language generation.

Plain English Explanation

The researchers have developed a new family of language models called IT5, which are designed specifically for the Italian language. Language models are AI systems that can understand and generate human language.

To create IT5, the researchers first compiled a large dataset of Italian text from various sources and carefully cleaned and processed the data. They then used this high-quality Italian data to train four different sizes of the IT5 model, ranging from small to large.

To test the performance of the IT5 models, the researchers created a new benchmark called ItaGen, which includes a variety of tasks that evaluate how well the models can understand and generate Italian text. They compared the IT5 models to other multilingual language models and found that the IT5 models consistently outperformed the competition, setting a new standard for Italian language generation.

Technical Explanation

The researchers introduce IT5, a family of encoder-decoder transformer models similar to the widely-used mT5 model, but trained specifically on Italian language data. They document a thorough cleaning and preprocessing procedure for a large Italian corpus, which is then used to pretrain four different sizes of the IT5 model.

To evaluate the performance of the IT5 models, the researchers introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for the Italian language. They compare the IT5 models to multilingual baselines like those used for automated translation and find that the monolingual IT5 models consistently outperform their multilingual counterparts, setting a new state-of-the-art for Italian language generation.

Critical Analysis

The paper provides a robust and well-executed approach to developing high-performing Italian language models. The researchers' careful data cleaning and preprocessing procedures help ensure the quality of the training data, which is crucial for building effective language models.

However, the paper does not address some potential limitations of the IT5 models. For example, it's unclear how well the models would perform on specialized or domain-specific Italian language tasks, or how they would handle rare or unusual Italian vocabulary. Further research on the limits of the IT5 models' capabilities would be valuable.

Additionally, the paper does not explore the potential biases or ethical considerations that may arise from deploying such powerful Italian language models in real-world applications. Addressing these concerns should be a priority for future work in this area.

Conclusion

The introduction of the IT5 family of Italian language models represents an important advancement in the field of natural language processing. By developing high-performing, monolingual models specifically for the Italian language, the researchers have set a new standard for Italian language generation and understanding. The ItaGen benchmark will also serve as a valuable resource for evaluating and comparing future Italian language models. As the field of AI continues to evolve, it will be important to build on this work and address the remaining challenges and limitations to ensure the responsible development and deployment of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

IT5: Text-to-text Pretraining for Italian Language Understanding and Generation

Gabriele Sarti, Malvina Nissim

We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.

5/21/2024

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Iker Garc'ia-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata, Andrea Zaninello

Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

4/12/2024

Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian

Serena Auriemma, Martina Miliani, Mauro Madeddu, Alessandro Bondielli, Lucia Passaro, Alessandro Lenci

Addressing the challenge of limited annotated data in specialized fields and low-resource languages is crucial for the effective use of Language Models (LMs). While most Large Language Models (LLMs) are trained on general-purpose English corpora, there is a notable gap in models specifically tailored for Italian, particularly for technical and bureaucratic jargon. This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in these specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. We evaluated the models on downstream tasks such as document classification and entity typing and conducted intrinsic evaluations using Pseudo-Log-Likelihood. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting. Furthermore, the application of calibration techniques and in-domain verbalizers significantly enhances the efficacy of encoder models. These domain-specialized models prove to be particularly advantageous in scenarios where in-domain resources or expertise are scarce. In conclusion, our findings offer new insights into the use of Italian models in specialized contexts, which may have a significant impact on both research and industrial applications in the digital transformation era.

7/31/2024

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Marcos Piau, Roberto Lotufo, Rodrigo Nogueira

Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0 and https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d.

6/18/2024