SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

Read original: arXiv:2405.06239 - Published 5/13/2024 by Faisal Qarah

💬

Overview

Introduces SaudiBERT, a language model trained exclusively on Saudi Arabic dialect text
Compares SaudiBERT to six other Arabic language models across 11 evaluation datasets
SaudiBERT outperformed all other models, achieving high F1-scores in sentiment analysis and text classification tasks
Presents two new large Saudi dialectal corpora used to train SaudiBERT

Plain English Explanation

The researchers developed a new language model called SaudiBERT that is specially trained on text in the Saudi Arabian dialect of Arabic. To test how well SaudiBERT works, they compared it to six other Arabic language models across a variety of tasks like sentiment analysis and text classification. SaudiBERT significantly outperformed the other models, achieving very high scores. The researchers also created two large new datasets of Saudi Arabic text, including over 141 million tweets and 15.2 GB of forum posts, which they used to train SaudiBERT. This shows that tailoring a language model to a specific dialect can lead to better performance on tasks involving that dialect, compared to more general language models.

Technical Explanation

The researchers introduced a new Arabic language model called SaudiBERT, which was pretrained exclusively on text in the Saudi Arabian dialect of Arabic. This is different from other Arabic language models that are trained on text from multiple dialects. To evaluate SaudiBERT, the researchers compared its performance to six other Arabic language models, including AcegptArabia, ArabicBERT, and AraBERT, across 11 different datasets covering sentiment analysis and text classification tasks. SaudiBERT achieved average F1-scores of 86.15% and 87.86% on these two task groups, significantly outperforming the other models.

The researchers also presented two new large Saudi dialectal corpora used to train SaudiBERT: the Saudi Tweets Mega Corpus (STMC) with over 141 million tweets, and the Saudi Forums Corpus (SFC) with 15.2 GB of text from Saudi online forums. These are the largest Saudi dialectal datasets reported in the literature and were critical for training the specialized SaudiBERT model.

Critical Analysis

The paper provides a thorough evaluation of SaudiBERT's performance compared to other Arabic language models, which is a strength. However, it does not delve into potential limitations or caveats of the research. For example, the paper does not discuss the generalizability of SaudiBERT beyond the specific tasks and datasets evaluated. There may be concerns about how well the model would perform on other types of Saudi Arabic text or in real-world applications.

Additionally, the researchers note that the SaudiBERT model is publicly available, but they do not provide details on its storage size, inference speed, or other practical considerations that would be important for potential users. Further analysis of the tradeoffs and practical implications of using SaudiBERT would help readers assess its suitability for their needs.

Conclusion

This research demonstrates the potential benefits of tailoring language models to specific dialects, rather than relying on more general models. By pretraining SaudiBERT exclusively on Saudi Arabic text, the researchers were able to achieve state-of-the-art performance on a range of tasks involving that dialect. The development of large, high-quality Saudi dialectal datasets was also a key contribution.

While the paper provides a strong technical evaluation, further analysis of the model's limitations and practical considerations would help readers better understand its real-world applicability. Overall, this work highlights the value of specialized language models and the importance of investing in linguistic diversity in natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

Faisal Qarah

In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15% and 87.86% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on url{https://huggingface.co/faisalq/SaudiBERT}.

5/13/2024

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Faisal Qarah

This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT's performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on url{https://huggingface.co/faisalq/EgyBERT}.

8/9/2024

💬

AlcLaM: Arabic Dialectal Language Model

Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu

Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub https://github.com/amurtadha/Alclam and HuggingFace https://huggingface.co/rahbi.

7/19/2024

101 Billion Arabic Words Dataset

Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic language models.

5/6/2024