Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

Read original: arXiv:2304.00228 - Published 8/14/2024 by Kai-Cheng Yang, Filippo Menczer

🎯

Overview

Large language models (LLMs) are increasingly used in search engines and AI chatbots to generate direct answers and access fresh data from the internet.
As curators of information for billions of users, LLMs must assess the accuracy and reliability of different sources.
This paper evaluates the ability of eight widely used LLMs from major providers (OpenAI, Google, and Meta) to discern credible and high-quality information sources from low-credibility ones.

Plain English Explanation

Large language models are powerful AI systems that can understand and generate human-like text. They are now being used in search engines and AI chatbots to provide direct answers to users' questions and access the latest information from the internet.

Since these LLMs are responsible for curating information for billions of people, it's important that they can accurately assess the credibility and reliability of different sources, such as news outlets. This paper looks at how well eight popular LLMs from leading tech companies (OpenAI, Google, and Meta) are able to evaluate the quality of various information sources.

The researchers found that while the LLMs generally agree with each other on the credibility of most news outlets, their ratings only moderately align with evaluations made by human experts. Additionally, the LLMs exhibited a liberal bias in their credibility ratings, and this bias became even stronger when the LLMs were assigned partisan identities.

These findings have significant implications for how LLMs are used to curate news and political information, as they suggest these models may not be fully objective and could potentially spread misinformation or reinforce political biases.

Technical Explanation

The researchers audited eight widely used LLMs from three major providers - OpenAI, Google, and Meta - to evaluate their ability to assess the credibility and reliability of different information sources. They tested the LLMs' performance on rating the credibility of a set of news outlets, including sources with varying political leanings in the US.

The key findings include:

LLMs exhibit a high level of agreement with each other in their credibility ratings (average Spearman's correlation of 0.81), but their ratings only moderately align with human expert evaluations (average correlation of 0.59).
Larger LLMs more frequently refused to provide ratings due to insufficient information, while smaller models were more prone to hallucination (generating unreliable ratings).
All LLMs in their default configurations showed a liberal bias in their credibility ratings of news sources.
Assigning partisan identities to the LLMs consistently resulted in strong politically congruent bias in their ratings.

These findings suggest that while LLMs can be useful tools for assessing information sources, their biases and limitations must be carefully considered when using them to curate news and political content for the public.

Critical Analysis

The paper highlights important caveats and areas for further research regarding the use of LLMs in information curation. For example, the researchers note that larger LLMs may be more reluctant to provide credibility ratings due to concerns about the reliability of their assessments, while smaller models are more prone to hallucination and generating unreliable ratings.

Additionally, the observed political biases in the LLMs' credibility ratings are concerning and raise questions about the suitability of using these models as sole arbiters of information quality, especially for sensitive topics like news and politics. The researchers suggest that these biases may be exacerbated when the LLMs are assigned partisan identities, which could lead to the amplification of misinformation or the suppression of certain viewpoints.

While the paper provides valuable insights, further research is needed to better understand the underlying causes of these biases and develop strategies to mitigate them. Exploring the use of ensemble methods or incorporating additional safeguards and oversight mechanisms could help improve the reliability and objectivity of LLMs in information curation.

Conclusion

This paper highlights the challenges and risks associated with using large language models (LLMs) to curate information sources, particularly for news and political content. While LLMs demonstrate a high level of agreement in their credibility assessments, their ratings only moderately align with human expert evaluations and exhibit concerning political biases.

As LLMs become more integrated into search engines and AI chatbots, it is crucial to carefully consider their limitations and potential for introducing misinformation or reinforcing political biases. Ongoing research and the development of robust safeguards will be essential to ensuring that these powerful language models are used responsibly and equitably to inform and empower, rather than mislead, the public.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎯

Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

Kai-Cheng Yang, Filippo Menczer

Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits eight widely used LLMs from three major providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to hallucination in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's $rho = 0.81$), but their ratings align only moderately with human expert evaluations (average $rho = 0.59$). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan identities to LLMs consistently results in strong politically congruent bias in the ratings. These findings have important implications for the use of LLMs in curating news and political information.

8/14/2024

💬

Assessing Political Bias in Large Language Models

Luca Rettenberger, Markus Reischl, Mark Schutera

The assessment of bias within Large Language Models (LLMs) has emerged as a critical concern in the contemporary discourse surrounding Artificial Intelligence (AI) in the context of their potential impact on societal dynamics. Recognizing and considering political bias within LLM applications is especially important when closing in on the tipping point toward performative prediction. Then, being educated about potential effects and the societal behavior LLMs can drive at scale due to their interplay with human operators. In this way, the upcoming elections of the European Parliament will not remain unaffected by LLMs. We evaluate the political bias of the currently most popular open-source LLMs (instruct or assistant models) concerning political issues within the European Union (EU) from a German voter's perspective. To do so, we use the Wahl-O-Mat, a voting advice application used in Germany. From the voting advice of the Wahl-O-Mat we quantize the degree of alignment of LLMs with German political parties. We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often remain neutral, particularly when prompted in English. The central finding is that LLMs are similarly biased, with low variances in the alignment concerning a specific party. Our findings underline the importance of rigorously assessing and making bias transparent in LLMs to safeguard the integrity and trustworthiness of applications that employ the capabilities of performative prediction and the invisible hand of machine learning prediction and language generation.

6/6/2024

Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles

Filip Trhlik, Pontus Stenetorp

Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their application within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both supervised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.

6/18/2024

💬

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daum'e III, Jordan Boyd-Graber

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

4/3/2024