Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages

Read original: arXiv:2404.19442 - Published 5/1/2024 by David Ifeoluwa Adelani, A. Seza Dou{g}ruoz, Iyanuoluwa Shode, Anuoluwapo Aremu
Total Score

0

🤖

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Naija is a Nigerian Pidgin spoken by approximately 120 million people in Nigeria, which is a mixed language that combines elements of English, Portuguese, and indigenous languages.
  • Naija has primarily been a spoken language, but there are now two written genres: BBC and Wikipedia.
  • The research paper investigates the linguistic differences between these two written genres and how they are represented in Generative AI models.

Plain English Explanation

Naija is a common language used by many people in Nigeria. It's a mix of English, Portuguese, and local Nigerian languages. For a long time, Naija was mainly a spoken language, but now there are two main written forms: one used by the BBC and one used on Wikipedia.

The researchers in this paper looked at the differences between these two written versions of Naija. They found that the language used in the BBC articles and the language used on Wikipedia are quite different. They have different word orders and vocabulary. This means that Generative AI models, which are used to create new text, can only work well with the Naija style used in the BBC articles. The Naija style used on Wikipedia is not well represented in these AI models.

Technical Explanation

The researchers conducted statistical analyses and machine translation experiments to compare the linguistic characteristics of the two written genres of Naija - the BBC genre and the Wikipedia genre. They found that these two genres do not represent each other well, as there are differences in word order and vocabulary between them.

Importantly, the researchers also discovered that Generative AI models, which are used to generate new text, only operate effectively on Naija written in the BBC genre. The Naija style used in the Wikipedia genre is not well represented in these AI models.

This suggests that the linguistic diversity of Naija is not fully captured in current Generative AI systems. The differences in orthographic variation between the two written genres likely contribute to this issue, as AI models may struggle to generalize across the varied representations of the language.

Critical Analysis

The research provides valuable insights into the linguistic landscape of Naija and the limitations of current Generative AI models in capturing this diversity. However, the study is limited to only two written genres of Naija, and there may be additional variations in the language that are not represented.

Further research is needed to better understand the challenges of code-switching and data scarcity in the context of Naija and other Nigerian languages, and to develop AI systems that can more effectively handle the complexity of African languages.

Additionally, the potential biases and issues in AI-generated content based on these limited representations of Naija should be carefully examined and addressed.

Conclusion

This research highlights the importance of considering linguistic diversity and the limitations of current AI models in effectively representing and generating text in languages like Naija. By acknowledging these gaps and exploring approaches to improve language learning and teaching, the research community can work towards developing more inclusive and accurate AI systems that can better serve diverse language communities.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤖

Total Score

0

Which Nigerian-Pidgin does Generative AI speak?: Issues about Representativeness and Bias for Multilingual and Low Resource Languages

David Ifeoluwa Adelani, A. Seza Dou{g}ruoz, Iyanuoluwa Shode, Anuoluwapo Aremu

Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI.

Read more

5/1/2024

Total Score

0

The Ghanaian NLP Landscape: A First Look

Sheriff Issaka, Zhaoyi Zhang, Mihir Heda, Keyi Wang, Yinka Ajibola, Ryan DeMar, Xuefeng Du

Despite comprising one-third of global languages, African languages are critically underrepresented in Artificial Intelligence (AI), threatening linguistic diversity and cultural heritage. Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk. This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages, identifying methodologies, datasets, and techniques employed. Additionally, we create a detailed roadmap outlining challenges, best practices, and future directions, aiming to improve accessibility for researchers. This work serves as a foundational resource for Ghanaian NLP research and underscores the critical need for integrating global linguistic diversity into AI development.

Read more

5/14/2024

Implicit Discourse Relation Classification For Nigerian Pidgin
Total Score

0

Implicit Discourse Relation Classification For Nigerian Pidgin

Muhammed Saeed, Peter Bourgonje, Vera Demberg

Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a native NP classifier outperforms our baseline by 13.27% and 33.98% in f$_{1}$ score for 4-way and 11-way classification, respectively.

Read more

6/28/2024

Linguistic Landscape of Generative AI Perception: A Global Twitter Analysis Across 14 Languages
Total Score

0

Linguistic Landscape of Generative AI Perception: A Global Twitter Analysis Across 14 Languages

Taichi Murayama, Kunihiro Miyazaki, Yasuko Matsubara, Yasushi Sakurai

The advent of generative AI tools has had a profound impact on societies globally, transcending geographical boundaries. Understanding these tools' global reception and utilization is crucial for service providers and policymakers in shaping future policies. Therefore, to unravel the perceptions and engagements of individuals within diverse linguistic communities with regard to generative AI tools, we extensively analyzed over 6.8 million tweets in 14 different languages. Our findings reveal a global trend in the perception of generative AI, accompanied by language-specific nuances. While sentiments toward these tools vary significantly across languages, there is a prevalent positive inclination toward Image tools and a negative one toward Chat tools. Notably, the ban of ChatGPT in Italy led to a sentiment decline and initiated discussions across languages. Furthermore, we established a taxonomy for interactions with chatbots, creating a framework for social analysis underscoring variations in generative AI usage among linguistic communities. We find that the Chinese community predominantly employs chatbots as substitutes for search, while the Italian community tends to present more intricate prompts. Our research provides a robust foundation for further explorations of the social dynamics surrounding generative AI tools and offers invaluable insights for decision-makers in policy, technology, and education.

Read more

5/31/2024