Orphan Articles: The Dark Matter of Wikipedia

    Read original: arXiv:2306.03940 - Published 10/8/2024 by Akhil Arora, Robert West, Martin Gerlach
    Total Score

    1

    🐍

    Sign in to get full access

    or

    If you already have an account, we'll log you in

    Overview

    • Wikipedia is the largest platform for open and freely accessible knowledge, with over 60 million articles in more than 300 language versions.
    • The available content has been growing continuously at a rate of around 200,000 new articles each month.
    • However, little attention has been paid to the accessibility of the content, specifically the integration of hyperlinks into the network.

    Plain English Explanation

    The researchers conducted a study on orphan articles, which are Wikipedia articles that do not have any incoming links from other Wikipedia articles. This means that these articles are essentially invisible to readers who are navigating through Wikipedia, as they cannot be easily discovered or accessed.

    The researchers found that a surprisingly large portion of Wikipedia's content, around 15% (8.8 million articles), is made up of these orphan articles. They describe this as the "dark matter of Wikipedia," highlighting the fact that a significant amount of the platform's knowledge is effectively hidden from readers.

    To address this issue, the researchers provided causal evidence through a quasi-experiment that adding new incoming links to orphan articles (a process they call "de-orphanization") leads to a statistically significant increase in the visibility of these articles in terms of the number of pageviews.

    The researchers also discussed the challenges faced by editors in de-orphanizing articles and the need to support them in addressing this problem. They suggested potential solutions, such as the development of automated tools based on cross-lingual approaches, to help improve the integration of orphan articles into the Wikipedia network.

    Technical Explanation

    The researchers conducted a systematic study of orphan articles across 319 different language versions of Wikipedia. They found that a surprisingly large extent of content, roughly 15% (8.8 million) of all articles, is effectively invisible to readers navigating Wikipedia due to a lack of incoming links.

    To understand the impact of this issue, the researchers performed a quasi-experiment by adding new incoming links to a subset of orphan articles and measuring the resulting change in the number of pageviews. The findings showed a statistically significant increase in the visibility of these "de-orphanized" articles.

    The researchers also highlighted the challenges faced by editors in de-orphanizing articles, such as the need to identify suitable source articles for adding links, and the lack of automated tools to support this process. They suggested the development of cross-lingual approaches to help address these challenges and improve the integration of orphan articles into the Wikipedia network.

    Critical Analysis

    The researchers acknowledged that their study focused on the quantitative assessment of the orphan article problem and its impact, rather than proposing comprehensive solutions. They noted that further research is needed to explore the underlying causes of the high proportion of orphan articles and to develop more effective strategies for addressing this issue.

    One potential limitation of the study is that the researchers did not explore the quality or importance of the orphan articles themselves. It is possible that a significant portion of these articles may contain valuable information that is being overlooked by readers due to their lack of visibility within the network.

    Additionally, the researchers did not investigate the potential challenges or biases that may arise when relying on automated tools for de-orphanizing articles, such as the possibility of introducing errors or unintended consequences. Careful consideration should be given to the development and implementation of such tools to ensure they do not exacerbate the problem.

    Conclusion

    This study highlights a significant limitation in the link structure of Wikipedia, where a substantial portion of the platform's content is effectively hidden from readers due to a lack of incoming links. The researchers quantified the extent of this problem and provided evidence that addressing it through de-orphanization can lead to increased visibility and accessibility of the content.

    The findings of this study have implications for the ongoing maintenance and development of Wikipedia, as they suggest the need for more coordinated efforts to integrate orphan articles into the network and support editors in this process. Developing automated tools and cross-lingual approaches may be a promising avenue for addressing this challenge and ensuring that Wikipedia's wealth of knowledge is equally accessible to all readers.



    This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

    Follow @aimodelsfyi on 𝕏 →

    Related Papers

    🐍

    Total Score

    1

    New!Orphan Articles: The Dark Matter of Wikipedia

    Akhil Arora, Robert West, Martin Gerlach

    With 60M articles in more than 300 language versions, Wikipedia is the largest platform for open and freely accessible knowledge. While the available content has been growing continuously at a rate of around 200K new articles each month, very little attention has been paid to the accessibility of the content. One crucial aspect of accessibility is the integration of hyperlinks into the network so the articles are visible to readers navigating Wikipedia. In order to understand this phenomenon, we conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles, across 319 different language versions of Wikipedia. We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia, and thus, rightfully term orphan articles as the dark matter of Wikipedia. We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility in terms of the number of pageviews. We further highlight the challenges faced by editors for de-orphanizing articles, demonstrate the need to support them in addressing this issue, and provide potential solutions for developing automated tools based on cross-lingual approaches. Overall, our work not only unravels a key limitation in the link structure of Wikipedia and quantitatively assesses its impact, but also provides a new perspective on the challenges of maintenance associated with content creation at scale in Wikipedia.

    Read more

    10/8/2024

    An Open Multilingual System for Scoring Readability of Wikipedia
    Total Score

    0

    An Open Multilingual System for Scoring Readability of Wikipedia

    Mykola Trokhymovych, Indira Sen, Martin Gerlach

    With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and improving upon previous benchmarks. These results demonstrate the applicability of the model at scale for languages in which there is no ground-truth data available for model fine-tuning. Furthermore, we provide the first overview on the state of readability in Wikipedia beyond English.

    Read more

    6/5/2024

    Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages
    Total Score

    0

    Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

    Paramita Das, Isaac Johnson, Diego Saez-Trumper, Pablo Arag'on

    Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use.

    Read more

    4/16/2024

    Low-resourced Languages and Online Knowledge Repositories: A Need-Finding Study
    Total Score

    0

    Low-resourced Languages and Online Knowledge Repositories: A Need-Finding Study

    Hellina Hailu Nigatu, John Canny, Sarah E. Chasins

    Online Knowledge Repositories (OKRs) like Wikipedia offer communities a way to share and preserve information about themselves and their ways of living. However, for communities with low-resourced languages -- including most African communities -- the quality and volume of content available are often inadequate. One reason for this lack of adequate content could be that many OKRs embody Western ways of knowledge preservation and sharing, requiring many low-resourced language communities to adapt to new interactions. To understand the challenges faced by low-resourced language contributors on the popular OKR Wikipedia, we conducted (1) a thematic analysis of Wikipedia forum discussions and (2) a contextual inquiry study with 14 novice contributors. We focused on three Ethiopian languages: Afan Oromo, Amharic, and Tigrinya. Our analysis revealed several recurring themes; for example, contributors struggle to find resources to corroborate their articles in low-resourced languages, and language technology support, like translation systems and spellcheck, result in several errors that waste contributors' time. We hope our study will support designers in making online knowledge repositories accessible to low-resourced language speakers.

    Read more

    5/28/2024