The Evolution of Darija Open Dataset: Introducing Version 2

Read original: arXiv:2405.13016 - Published 5/24/2024 by Aissam Outchakoucht, Hamza Es-Samaali

🔍

Overview

The Darija Open Dataset (DODa) is an open-source project aimed at improving Natural Language Processing (NLP) capabilities for the Moroccan dialect, Darija.
DODa contains approximately 100,000 entries, making it the largest collaborative project of its kind for Darija-English translation.
The dataset includes semantic and syntactic categorizations, spelling variations, verb conjugations across multiple tenses, and tens of thousands of translated sentences.
The entries are written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications.

Plain English Explanation

The Darija Open Dataset (DODa) is a valuable resource for improving the understanding and generation of the Moroccan dialect, Darija, in natural language processing applications. With around 100,000 entries, it is the largest collaborative project of its kind, providing Darija-English translations, as well as information on the semantic and grammatical structure of the language.

The dataset includes a wide range of linguistic elements, such as different spellings, verb conjugations in multiple tenses, and thousands of translated sentences. This diversity reflects the real-world variations found in Darija usage across various sources and applications. By having access to this comprehensive dataset, developers can create applications that can accurately understand and generate Darija, meeting the linguistic needs of the Moroccan community and potentially extending to similar dialects in neighboring regions.

Technical Explanation

The Darija Open Dataset (DODa) represents a significant effort to enhance Natural Language Processing (NLP) capabilities for the Moroccan dialect, Darija. With approximately 100,000 entries, DODa is the largest collaborative project of its kind for Darija-English translation. The dataset includes a wide range of linguistic information, such as semantic and syntactic categorizations, spelling variations, verb conjugations across multiple tenses, and tens of thousands of translated sentences.

The entries in DODa are written in both Latin and Arabic alphabets, reflecting the diverse linguistic preferences and sources from which the data was collected. This diversity is crucial for developing applications that can accurately understand and generate Darija, catering to the needs of the Moroccan community and potentially extending to similar dialects in neighboring regions.

Critical Analysis

The Darija Open Dataset (DODa) is a valuable contribution to the field of Natural Language Processing, particularly for the Moroccan dialect of Arabic. By providing a comprehensive dataset with a wide range of linguistic information, DODa aims to address the challenges of accurately understanding and generating Darija in various applications.

One potential limitation of the dataset is the geographical and cultural diversity within Morocco, which may result in additional variations and nuances in the Darija dialect that are not fully captured in the current version of DODa. As the project continues to grow, it would be beneficial to expand the dataset to include more diverse sources and representations of Darija, ensuring that the linguistic needs of all Moroccan communities are addressed.

Additionally, while DODa focuses on the Moroccan dialect, there may be opportunities to explore the applicability of the dataset to other similar dialects in the region, such as those found in Algeria or Tunisia. Expanding the dataset's scope could further enhance its value and impact in the broader NLP landscape.

Conclusion

The Darija Open Dataset (DODa) represents a significant step forward in improving Natural Language Processing capabilities for the Moroccan dialect of Arabic. By providing a comprehensive dataset with a wide range of linguistic information, DODa aims to support the development of applications that can accurately understand and generate Darija, catering to the needs of the Moroccan community.

The availability of such a dataset is crucial for advancing the field of NLP, particularly in underserved language communities. As the project continues to evolve, incorporating more diverse sources and expanding its scope to similar dialects in the region could further enhance its impact and contribute to the broader goal of multilingual language understanding and generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

The Evolution of Darija Open Dataset: Introducing Version 2

Aissam Outchakoucht, Hamza Es-Samaali

Darija Open Dataset (DODa) represents an open-source project aimed at enhancing Natural Language Processing capabilities for the Moroccan dialect, Darija. With approximately 100,000 entries, DODa stands as the largest collaborative project of its kind for Darija-English translation. The dataset features semantic and syntactic categorizations, variations in spelling, verb conjugations across multiple tenses, as well as tens of thousands of translated sentences. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications. The availability of such dataset is critical for developing applications that can accurately understand and generate Darija, thus supporting the linguistic needs of the Moroccan community and potentially extending to similar dialects in neighboring regions. This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements that will continue to promote its use and expansion in the global NLP landscape.

5/24/2024

💬

DarijaBanking: A New Resource for Overcoming Language Barriers in Banking Intent Detection for Moroccan Arabic Speakers

Abderrahman Skiredj, Ferdaous Azhari, Ismail Berrada, Saad Ezzini

Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems, especially in specialized domains like banking. The Moroccan Dialect (Darija) serves as the common language that blends cultural complexities, historical impacts, and regional differences. The complexities of Darija present a special set of challenges for language models, as it differs from Modern Standard Arabic with strong influence from French, Spanish, and Tamazight, it requires a specific approach for effective communication. To tackle these challenges, this paper introduces textbf{DarijaBanking}, a novel Darija dataset aimed at enhancing intent classification in the banking domain, addressing the critical need for automatic banking systems (e.g., chatbots) that communicate in the native language of Moroccan clients. DarijaBanking comprises over 1,800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes. We experimented with various intent classification methods, including full fine-tuning of monolingual and multilingual models, zero-shot learning, retrieval-based approaches, and Large Language Model prompting. One of the main contributions of this work is BERTouch, our BERT-based language model for intent classification in Darija. BERTouch achieved F1-scores of 0.98 for Darija and 0.96 for MSA on DarijaBanking, outperforming the state-of-the-art alternatives including GPT-4 showcasing its effectiveness in the targeted application.

5/28/2024

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing

We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.

9/27/2024

101 Billion Arabic Words Dataset

Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic language models.

5/6/2024