DarijaBanking: A New Resource for Overcoming Language Barriers in Banking Intent Detection for Moroccan Arabic Speakers

Read original: arXiv:2405.16482 - Published 5/28/2024 by Abderrahman Skiredj, Ferdaous Azhari, Ismail Berrada, Saad Ezzini
Total Score

0

💬

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Introduces a new dataset called DarijaBanking for intent detection in Moroccan Arabic (Darija) banking conversations
  • Aims to overcome language barriers and improve natural language understanding for Moroccan Arabic speakers
  • Presents the creation and evaluation of the DarijaBanking dataset, as well as its potential applications

Plain English Explanation

The paper discusses the creation of a new dataset called DarijaBanking that is designed to help improve the ability of natural language processing (NLP) systems to understand the intents and goals of Moroccan Arabic (also known as Darija) speakers when they interact with banking services.

One of the key challenges in banking and other industries is overcoming language barriers, as many services and technologies are often only available in the dominant or official languages of a region. This can make it difficult for people who primarily speak other dialects or minority languages to effectively use these services.

The DarijaBanking dataset was created to help address this issue for Moroccan Arabic speakers. By developing a dataset of typical banking-related conversations and queries in Darija, the researchers hope to enable the creation of NLP models that can better understand the intents and needs of this user group. This could lead to the development of banking chatbots, virtual assistants, and other technologies that are more accessible and responsive to Moroccan Arabic speakers.

The paper explains the process of collecting, annotating, and validating the DarijaBanking dataset, as well as providing an initial evaluation of its performance on intent detection tasks. The researchers also discuss how this dataset could be used to advance natural language processing for dialects and potentially serve as a foundation for building large-scale Arabic language models that are tailored to specific regional dialects.

Technical Explanation

The paper introduces the DarijaBanking dataset, which is a collection of Moroccan Arabic (Darija) language utterances related to banking tasks and intents. The researchers created this dataset to help overcome language barriers and improve natural language understanding for Moroccan Arabic speakers interacting with banking services.

To build the DarijaBanking dataset, the researchers first collected a large corpus of Darija text from various online sources, including social media, forums, and other user-generated content. They then manually annotated a subset of this data with intent labels related to common banking tasks, such as checking account balances, making transfers, and applying for loans.

The annotated dataset was divided into training, validation, and test sets, which were used to evaluate the performance of intent detection models. The researchers experimented with several machine learning approaches, including traditional statistical models and more recent transformer-based neural networks. Their results showed that the DarijaBanking dataset could be used to train models that achieve high accuracy on banking intent detection tasks for Moroccan Arabic speakers.

The paper also discusses the potential applications of the DarijaBanking dataset, such as the development of banking chatbots and virtual assistants that can better understand and respond to Darija speakers. The researchers suggest that the dataset could also be used to build large-scale Arabic language models that are tailored to regional dialects, which could have broader applications in Arabic natural language processing.

Critical Analysis

The DarijaBanking dataset and the research presented in this paper represent a valuable contribution to the field of natural language processing, particularly in the context of improving language accessibility for underserved populations.

One potential limitation of the study is the size and diversity of the dataset. While the researchers collected a substantial amount of Darija text, the annotated dataset used for intent detection tasks may still be relatively small compared to the vast diversity of the Moroccan Arabic language. Expanding the dataset to include a wider range of banking-related topics, scenarios, and language styles could further improve the performance and robustness of the intent detection models.

Additionally, the paper does not provide a detailed analysis of the potential biases or limitations of the dataset, such as the demographic or socioeconomic characteristics of the data sources. Addressing these potential biases and ensuring the dataset is representative of the broader Moroccan Arabic-speaking population would be important for ensuring the fairness and inclusivity of the technology developed using this resource.

Overall, the DarijaBanking dataset and the research presented in this paper represent a significant step forward in improving natural language understanding and accessibility for Moroccan Arabic speakers in the banking industry. The dataset and the insights gained from this work could also have broader applications in other domains and industries where language barriers are a challenge.

Conclusion

The DarijaBanking dataset introduced in this paper is a valuable resource for overcoming language barriers and improving natural language understanding for Moroccan Arabic (Darija) speakers interacting with banking services. By developing a dataset of annotated banking-related conversations in Darija, the researchers have laid the groundwork for the creation of more accessible and responsive banking technologies, such as chatbots and virtual assistants.

This research also has the potential to contribute to the broader field of Arabic natural language processing, particularly in the area of processing and understanding regional dialects. The DarijaBanking dataset could serve as a foundation for building large-scale language models that are tailored to specific dialects, which could have far-reaching implications for improving the accessibility and inclusivity of language technologies across the Arabic-speaking world.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Total Score

0

DarijaBanking: A New Resource for Overcoming Language Barriers in Banking Intent Detection for Moroccan Arabic Speakers

Abderrahman Skiredj, Ferdaous Azhari, Ismail Berrada, Saad Ezzini

Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems, especially in specialized domains like banking. The Moroccan Dialect (Darija) serves as the common language that blends cultural complexities, historical impacts, and regional differences. The complexities of Darija present a special set of challenges for language models, as it differs from Modern Standard Arabic with strong influence from French, Spanish, and Tamazight, it requires a specific approach for effective communication. To tackle these challenges, this paper introduces textbf{DarijaBanking}, a novel Darija dataset aimed at enhancing intent classification in the banking domain, addressing the critical need for automatic banking systems (e.g., chatbots) that communicate in the native language of Moroccan clients. DarijaBanking comprises over 1,800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes. We experimented with various intent classification methods, including full fine-tuning of monolingual and multilingual models, zero-shot learning, retrieval-based approaches, and Large Language Model prompting. One of the main contributions of this work is BERTouch, our BERT-based language model for intent classification in Darija. BERTouch achieved F1-scores of 0.98 for Darija and 0.96 for MSA on DarijaBanking, outperforming the state-of-the-art alternatives including GPT-4 showcasing its effectiveness in the targeted application.

Read more

5/28/2024

AraFinNLP 2024: The First Arabic Financial NLP Shared Task
Total Score

0

AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Sanad Malaysha, Mo El-Haj, Saad Ezzini, Mohammed Khalilia, Mustafa Jarrar, Sultan Almujaiwel, Ismail Berrada, Houda Bouamor

The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots. A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.

Read more

7/16/2024

🔍

Total Score

0

The Evolution of Darija Open Dataset: Introducing Version 2

Aissam Outchakoucht, Hamza Es-Samaali

Darija Open Dataset (DODa) represents an open-source project aimed at enhancing Natural Language Processing capabilities for the Moroccan dialect, Darija. With approximately 100,000 entries, DODa stands as the largest collaborative project of its kind for Darija-English translation. The dataset features semantic and syntactic categorizations, variations in spelling, verb conjugations across multiple tenses, as well as tens of thousands of translated sentences. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications. The availability of such dataset is critical for developing applications that can accurately understand and generate Darija, thus supporting the linguistic needs of the Moroccan community and potentially extending to similar dialects in neighboring regions. This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements that will continue to promote its use and expansion in the global NLP landscape.

Read more

5/24/2024

🗣️

Total Score

0

New!Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between Arabic and Latin Scripted dialect

Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane

Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin letters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 20 000 manually labeled text in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect. To dive into sentiment analysis, we conducted a comparative study on multiple Machine learning models to assess their compatibility with our dataset. Experiments were performed using both raw and preprocessed data to show the importance of the preprocessing step. We were able to achieve 92% accuracy in our model and to further prove its liability we tested our model on smaller publicly available datasets of Moroccan dialect and the results were favorable.

Read more

9/16/2024