Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Read original: arXiv:2404.06976 - Published 4/11/2024 by Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira
Total Score

0

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents Quati, a new Brazilian Portuguese information retrieval dataset created by native speakers.
  • The dataset was designed to address the lack of publicly available resources for evaluating Portuguese language models and information retrieval systems.
  • The dataset consists of a collection of queries and relevant documents, covering a variety of topics and reflecting real-world search behaviors.

Plain English Explanation

The researchers behind this paper recognized a gap in the availability of high-quality datasets for evaluating Portuguese language models and information retrieval systems. To address this, they created a new dataset called Quati, which contains a collection of queries and relevant documents in Brazilian Portuguese.

The Quati dataset was designed to be representative of real-world search behaviors, covering a diverse range of topics that people might search for. By providing this resource to the research community, the authors hope to spur the development of more robust and effective Portuguese language technologies.

The dataset was created by native Brazilian Portuguese speakers, ensuring that the queries and documents reflect the natural language and search patterns of the target user population. This attention to authenticity and relevance sets Quati apart from other Portuguese language datasets that may have been developed without direct input from native speakers.

Technical Explanation

The Quati dataset was constructed by the authors using a multi-step process. First, they recruited a team of native Brazilian Portuguese speakers to generate queries based on their own search experiences. These queries spanned a wide range of topics, from general knowledge to specific domains like politics, sports, and entertainment.

For each query, the researchers then gathered a set of relevant documents from the web, ensuring that the retrieved content was topically relevant and of high quality. The resulting dataset consists of over 10,000 query-document pairs, providing a substantial resource for training and evaluating information retrieval models.

To validate the quality and relevance of the Quati dataset, the authors conducted a series of experiments. This included measuring the diversity of the query and document topics, as well as assessing the performance of several baseline information retrieval models on the dataset. The results demonstrated the suitability of Quati for benchmarking Portuguese language technologies.

Critical Analysis

The Quati dataset represents a valuable contribution to the field of Portuguese language processing, as it addresses an important gap in the availability of high-quality, publicly accessible resources. By involving native speakers in the dataset creation process, the authors have ensured that the queries and documents reflect authentic Brazilian Portuguese usage and search patterns.

However, the paper does not provide detailed information about the demographic characteristics of the participants who generated the queries. It would be useful to know if the dataset is representative of the broader Brazilian population, or if it skews towards certain age, gender, or socioeconomic groups. This information could help researchers understand the dataset's potential biases and limitations.

Additionally, the paper does not discuss the ethical considerations involved in constructing the dataset, such as how the authors addressed privacy and consent concerns when collecting web-based documents. As language datasets can potentially contain sensitive or personal information, it is important for researchers to be transparent about their data curation practices.

Conclusion

The Quati dataset represents a significant step forward in the development of Portuguese language processing technologies. By providing a high-quality, representative dataset for information retrieval, the authors have created a valuable resource for the research community.

The dataset's potential impact extends beyond information retrieval, as it could also be leveraged for tasks like machine translation, text summarization, and question answering. As the field of Portuguese natural language processing continues to evolve, resources like Quati will be crucial for driving progress and ensuring the development of technology that is responsive to the needs of native speakers.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers
Total Score

0

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .

Read more

4/11/2024

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs
Total Score

0

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, Firoj Alam

Natural Question Answering (QA) datasets play a crucial role in developing and evaluating the capabilities of large language models (LLMs), ensuring their effective usage in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. In this study, we propose a scalable framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. Moreover, to demonstrate the efficacy of the proposed framework, we designed a multilingual natural QA dataset, MultiNativQA, consisting of ~72K QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers covering 18 topics. We benchmark the MultiNativQA dataset with open- and closed-source LLMs. We made both the framework NativQA and MultiNativQA dataset publicly available for the community. (https://nativqa.gitlab.io)

Read more

7/16/2024

🌿

Total Score

0

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

Read more

4/29/2024

🧠

Total Score

0

PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Tom'as Os'orio, Bernardo Leite, Henrique Lopes Cardoso, Lu'is Gomes, Jo~ao Rodrigues, Rodrigo Santos, Ant'onio Branco

Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work. Similarly, the respective fine-tuned neural language models, developed with a low-rank adaptation approach, are made available as baselines that can stimulate future work on the neural processing of Portuguese. All datasets and models have been developed and are made available for two variants of Portuguese: European and Brazilian.

Read more

5/10/2024