Automatic Generation of Web Censorship Probe Lists

Read original: arXiv:2407.08185 - Published 7/12/2024 by Jenny Tang, Leo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas Christin

Automatic Generation of Web Censorship Probe Lists

Overview

Describes a method for automatically generating lists of websites to probe for web censorship
Aims to create a comprehensive and representative set of URLs to detect censorship in different regions and on diverse topics
Relies on techniques like keyword extraction, URL generation, and diversity optimization to create probe lists

Plain English Explanation

This paper presents a technique for automatically generating lists of websites to test for web censorship. The idea is to create a comprehensive and representative set of URLs that can be used to detect censorship in different regions and on a wide range of topics.

The approach involves several steps. First, the researchers extract keywords from a large corpus of web content. They then use these keywords to generate URLs that are likely to be relevant and diverse. Finally, they optimize the list of URLs to ensure it covers a range of topics and perspectives.

The goal is to provide researchers and activists with a powerful tool for uncovering instances of web censorship around the world. By probing a carefully curated set of URLs, they can get a more complete picture of what content is being blocked and where.

Technical Explanation

The paper describes a three-step process for automatically generating web censorship probe lists:

<a href="https://aimodels.fyi/papers/arxiv/pathfinder-exploring-path-diversity-assessing-internet-censorship">Keyword extraction</a>: The researchers extract relevant keywords from a large corpus of web content using techniques like term frequency-inverse document frequency (TF-IDF) and latent Dirichlet allocation (LDA).
URL generation: They then use these keywords to generate candidate URLs, drawing on strategies like appending keywords to common URL prefixes and applying rules to ensure diversity (e.g., mixing long and short URLs, using different top-level domains).
<a href="https://aimodels.fyi/papers/arxiv/misinformation-resilient-search-rankings-webgraph-based-interventions">Diversity optimization</a>: Finally, the researchers optimize the list of URLs to maximize coverage of different topics and perspectives. This involves techniques like clustering and importance sampling to identify a representative set of URLs.

The resulting probe lists are designed to be comprehensive and diverse, allowing researchers to detect censorship on a wide range of subjects in different regions.

Critical Analysis

The paper presents a novel and promising approach to generating web censorship probe lists. By automating the process, the researchers aim to create more comprehensive and representative lists than could be produced manually.

One potential limitation is the reliance on keyword extraction and URL generation heuristics, which may not capture the full complexity of web content and censorship patterns. <a href="https://aimodels.fyi/papers/arxiv/finding-fake-news-websites-wild">Further research</a> could explore more advanced techniques, such as incorporating user feedback or leveraging semantic analysis.

Additionally, the paper does not address potential <a href="https://aimodels.fyi/papers/arxiv/sinbad-saliency-informed-detection-breakage-caused-by">ethical concerns</a> around web censorship measurement, such as the risk of probing sensitive content or drawing unwanted attention to individuals or organizations. Careful consideration of these issues will be important as this technology is developed further.

Conclusion

This paper presents a novel approach to automatically generating web censorship probe lists, with the goal of creating a comprehensive and representative set of URLs for detecting censorship around the world. By leveraging techniques like keyword extraction, URL generation, and diversity optimization, the researchers have developed a powerful tool for uncovering instances of web censorship.

While the paper has some limitations, it represents an important step towards <a href="https://aimodels.fyi/papers/arxiv/watching-watchers-comparative-fairness-audit-cloud-based">improving our understanding and monitoring of web censorship</a>. As this research continues to evolve, it could have significant implications for internet freedom and the global fight against online repression.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Automatic Generation of Web Censorship Probe Lists

Jenny Tang, Leo Alvarez, Arjun Brar, Nguyen Phong Hoang, Nicolas Christin

Domain probe lists--used to determine which URLs to probe for Web censorship--play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains--not present in the original dataset--we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.

7/12/2024

📉

Pathfinder: Exploring Path Diversity for Assessing Internet Censorship Inconsistency

Xiaoqin Liang, Guannan Liu, Lin Jin, Shuai Hao, Haining Wang

Internet censorship is typically enforced by authorities to achieve information control for a certain group of Internet users. So far existing censorship studies have primarily focused on country-level characterization because (1) in many cases, censorship is enabled by governments with nationwide policies and (2) it is usually hard to control how the probing packets are routed to trigger censorship in different networks inside a country. However, the deployment and implementation of censorship could be highly diverse at the ISP level. In this paper, we investigate Internet censorship from a different perspective by scrutinizing the diverse censorship deployment inside a country. Specifically, by leveraging an end-to-end measurement framework, we deploy multiple geo-distributed back-end control servers to explore various paths from one single vantage point. The generated traffic with the same domain but different control servers' IPs could be forced to traverse different transit networks, thereby being examined by different censorship devices if present. Through our large-scale experiments and in-depth investigation, we reveal that the diversity of Internet censorship caused by different routing paths inside a country is prevalent, implying that (1) the implementations of centralized censorship are commonly incomplete or flawed and (2) decentralized censorship is also common. Moreover, we identify that different hosting platforms also result in inconsistent censorship activities due to different peering relationships with the ISPs in a country. Finally, we present extensive case studies in detail to illustrate the configurations that lead to censorship inconsistency and explore the causes.

7/8/2024

Finding Fake News Websites in the Wild

Leandro Araujo, Joao M. M. Couto, Luiz Felipe Nery, Isadora C. Rodrigues, Jussara M. Almeida, Julio C. S. Reis, Fabricio Benevenuto

The battle against the spread of misinformation on the Internet is a daunting task faced by modern society. Fake news content is primarily distributed through digital platforms, with websites dedicated to producing and disseminating such content playing a pivotal role in this complex ecosystem. Therefore, these websites are of great interest to misinformation researchers. However, obtaining a comprehensive list of websites labeled as producers and/or spreaders of misinformation can be challenging, particularly in developing countries. In this study, we propose a novel methodology for identifying websites responsible for creating and disseminating misinformation content, which are closely linked to users who share confirmed instances of fake news on social media. We validate our approach on Twitter by examining various execution modes and contexts. Our findings demonstrate the effectiveness of the proposed methodology in identifying misinformation websites, which can aid in gaining a better understanding of this phenomenon and enabling competent entities to tackle the problem in various areas of society.

7/16/2024

Misinformation Resilient Search Rankings with Webgraph-based Interventions

Peter Carragher, Evan M. Williams, Kathleen M. Carley

The proliferation of unreliable news domains on the internet has had wide-reaching negative impacts on society. We introduce and evaluate interventions aimed at reducing traffic to unreliable news domains from search engines while maintaining traffic to reliable domains. We build these interventions on the principles of fairness (penalize sites for what is in their control), generality (label/fact-check agnostic), targeted (increase the cost of adversarial behavior), and scalability (works at webscale). We refine our methods on small-scale webdata as a testbed and then generalize the interventions to a large-scale webgraph containing 93.9M domains and 1.6B edges. We demonstrate that our methods penalize unreliable domains far more than reliable domains in both settings and we explore multiple avenues to mitigate unintended effects on both the small-scale and large-scale webgraph experiments. These results indicate the potential of our approach to reduce the spread of misinformation and foster a more reliable online information ecosystem. This research contributes to the development of targeted strategies to enhance the trustworthiness and quality of search engine results, ultimately benefiting users and the broader digital community.

4/16/2024