Navigating the Post-API Dilemma | Search Engine Results Pages Present a Biased View of Social Media Data

Read original: arXiv:2401.15479 - Published 4/3/2024 by Amrit Poudel, Tim Weninger

📊

Overview

Recent decisions to discontinue social media APIs are having detrimental effects on Internet research and computational social science.
This lack of access to data has been dubbed the "Post-API era" of Internet research.
Search engines like Google may provide a solution by crawling and surfacing social media data on their Search Engine Results Pages (SERP).
The paper investigates whether SERP provides a complete and unbiased sample of social media data, and if it is a viable alternative to direct API access.

Plain English Explanation

Researchers rely on data from social media platforms like Twitter and Reddit to study how people behave and interact online. However, many platforms have recently restricted access to this data, making it much harder for researchers to do their work. This is known as the "Post-API era" of Internet research.

Fortunately, popular search engines like Google have the ability to find and display social media content on their search result pages, which could potentially solve this problem. The researchers in this paper wanted to know if the social media data shown on search results is truly representative and unbiased, or if it has significant gaps and skewed towards certain types of content.

To figure this out, the researchers compared the search results to the actual, unfiltered data from Twitter and Reddit. What they found was that the search results were heavily biased - they favored popular and positive content, and tended to exclude political, explicit, or controversial posts. There were also major gaps in the topics covered compared to the full social media data.

Overall, the researchers concluded that using search engine results is not a good replacement for direct access to social media data. The results are just too limited and skewed to give researchers an accurate picture of what's really happening online.

Technical Explanation

The paper conducts a comparative analysis between data obtained directly from Reddit and Twitter/X, and the social media content surfaced on Google's Search Engine Results Pages (SERP).

The researchers first obtained a complete, unfiltered dataset from Reddit and Twitter/X. This "ground truth" dataset represents the full, unbiased sample of social media activity.

They then collected the top search results from Google for a set of carefully chosen queries designed to surface relevant social media content. This SERP dataset was then compared to the ground truth in several ways:

Popularity Bias: The SERP results were found to be heavily skewed towards highly popular social media posts, neglecting less viral content.
Sentiment Bias: The sentiment expressed in the SERP results was more positive on average compared to the ground truth dataset.
Topical Gaps: Significant gaps were observed in the topics covered by the SERP results versus the full social media data.
Content Filtering: The SERP results were found to systematically exclude political, pornographic, and profane content that was present in the ground truth data.

Through this comprehensive analysis, the researchers conclude that SERP is not a viable replacement for direct access to social media data via APIs. The biases and gaps present in the SERP results make it an incomplete and unreliable source for conducting computational social science research.

Critical Analysis

The paper provides a thorough and rigorous analysis of the limitations of using search engine results as a proxy for social media data. The researchers carefully designed their experiment to enable a direct comparison between SERP and the ground truth datasets.

However, one potential limitation of the study is the use of a single search engine (Google) and the reliance on the top search results. It's possible that other search engines or different ways of surfacing social media content on SERPs could yield different results.

Additionally, the paper does not delve into the specific mechanisms or algorithms used by search engines to select and rank social media content. A deeper understanding of these processes could provide further insights into the origins of the observed biases.

Another area for further research could be investigating ways to mitigate the biases, such as techniques for retrieving a more representative sample of social media data from search engines. This could potentially make SERP a more useful resource for computational social science research.

Conclusion

The discontinuation of social media APIs has created significant challenges for Internet researchers, but using search engine results as an alternative is not a viable solution. This paper demonstrates that the social media content surfaced on SERPs is heavily biased, with significant gaps and systematic exclusion of certain types of content.

These findings highlight the critical importance of maintaining direct access to social media data for researchers in the field of computational social science. Without unbiased data, our understanding of online behavior and social dynamics will be incomplete and potentially skewed.

The issues raised in this paper underscore the need for search engines, social media platforms, and the research community to work together to find new ways to provide researchers with the data they need to advance our knowledge of the digital world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Navigating the Post-API Dilemma | Search Engine Results Pages Present a Biased View of Social Media Data

Amrit Poudel, Tim Weninger

Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.

4/3/2024

Cognitively Biased Users Interacting with Algorithmically Biased Results in Whole-Session Search on Debated Topics

Ben Wang, Jiqun Liu

When interacting with information retrieval (IR) systems, users, affected by confirmation biases, tend to select search results that confirm their existing beliefs on socially significant contentious issues. To understand the judgments and attitude changes of users searching online, our study examined how cognitively biased users interact with algorithmically biased search engine result pages (SERPs). We designed three-query search sessions on debated topics under various bias conditions. We recruited 1,321 crowdsourcing participants and explored their attitude changes, search interactions, and the effects of confirmation bias. Three key findings emerged: 1) most attitude changes occur in the initial query of a search session; 2) Confirmation bias and result presentation on SERPs affect the number and depth of clicks in the current query and perceived familiarity with clicked results in subsequent queries; 3) The bias position also affects attitude changes of users with lower perceived openness to conflicting opinions. Our study goes beyond traditional simulation-based evaluation settings and simulated rational users, sheds light on the mixed effects of human biases and algorithmic biases in information retrieval tasks on debated topics, and can inform the design of bias-aware user models, human-centered bias mitigation techniques, and socially responsible intelligent IR systems.

6/10/2024

RIP Twitter API: A eulogy to its vast research contributions

Ryan Murtfeldt, Naomi Alterman, Ihsan Kahveci, Jevin D. West

Since 2006, Twitter's Application Programming Interface (API) has been a treasure trove of high-quality data for researchers studying everything from the spread of misinformation, to social psychology and emergency management. However, in the spring of 2023, Twitter (now called X) began changing $42,000/month for its Enterprise access level, an essential death knell for researcher use. Lacking sufficient funds to pay this monthly fee, academics are now scrambling to continue their research without this important data source. This study collects and tabulates the number of studies, number of citations, dates, major disciplines, and major topic areas of studies that used Twitter data between 2006 and 2023. While we cannot know for certain what will be lost now that Twitter data is cost prohibitive, we can illustrate its research value during the time it was available. A search of 8 databases and 3 related APIs found that since 2006, a total of 27,453 studies have been published in 7,432 publication venues, with 1,303,142 citations, across 14 disciplines. Major disciplines include: computational social science, engineering, data science, social media studies, public health, and medicine. Major topics include: information dissemination, assessing the credibility of tweets, strategies for conducting data research, detecting and analyzing major events, and studying human behavior. Twitter data studies have increased every year since 2006, but following Twitter's decision to begin charging for data in the spring of 2023, the number of studies published in 2023 decreased by 13% compared to 2022. We assume that much of the data used for studies published in 2023 were collected prior to Twitter's shutdown, and thus the number of new studies are likely to decline further in subsequent years.

4/12/2024

Algorithmic Misjudgement in Google Search Results: Evidence from Auditing the US Online Electoral Information Environment

Brooke Perreault, Johanna Lee, Ropafadzo Shava, Eni Mustafaraj

Google Search is an important way that people seek information about politics, and Google states that it is ``committed to providing timely and authoritative information on Google Search to help voters understand, navigate, and participate in democratic processes.'' This paper studies the extent to which government-maintained web domains are represented in the online electoral information environment, as captured through 3.45 Google Search result pages collected during the 2022 US midterm elections for 786 locations across the United States. Focusing on state, county, and local government domains that provide locality-specific information, we study not only the extent to which these sources appear in organic search results, but also the extent to which these sources are correctly targeted to their respective constituents. We label misalignment between the geographic area that non-federal domains serve and the locations for which they appear in search results as algorithmic mistargeting, a subtype of algorithmic misjudgement in which the search algorithm targets locality-specific information to users in different (incorrect) locations. In the context of the 2022 US midterm elections, we find that 71% of all occurrences of state, county, and local government sources were mistargeted, with some domains appearing disproportionately often among organic results despite providing locality-specific information that may not be relevant to all voters. However, we also find that mistargeting often occurs in low ranks. We conclude by considering the potential consequences of extensive mistargeting of non-federal government sources and argue that ensuring the correct targeting of these sources to their respective constituents is a critical part of Google's role in facilitating access to authoritative and locally-relevant electoral information.

6/18/2024