Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias

2405.15739

Published 5/30/2024 by Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmackers, Vincent Ginis

Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias

Abstract

Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) like GPT-4 introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset of 166 papers from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date, encompassing 3,066 references in total. In our experiment, GPT-4 was tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias in GPT-4, which persists even after controlling for publication year, title length, number of authors, and venue. Additionally, we observe a large consistency between the characteristics of GPT-4's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended by GPT-4 are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases and introduce new ones, potentially skewing scientific knowledge dissemination. Our results underscore the need for identifying the model's biases and for developing balanced methods to interact with LLMs in general.

Create account to get full access

Overview

This paper examines how large language models (LLMs) can generate citations, and how those citations reflect the biases present in the underlying human-written citation patterns.
The researchers find that LLMs tend to amplify existing citation biases, such as preferential attachment and the Matthew effect, leading to a heightened citation bias in the generated citations.
The paper provides insights into the behavior of LLMs when it comes to scholarly citation practices, with potential implications for the use of these models in academic writing and research.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. These models are trained on vast amounts of online data, including academic papers and their citations.

When LLMs are tasked with generating new citations, they tend to mimic the citation patterns they've learned from the training data. This means that the biases present in the original human-written citations, such as a tendency to cite well-known or highly cited papers, are amplified in the model's outputs.

This research paper explores this phenomenon in detail. The researchers find that LLMs are more likely to cite papers that are already highly cited, a phenomenon known as the "Matthew effect." They also observe a "preferential attachment" bias, where the model is more likely to cite papers that already have many citations.

These biases can have important implications for the use of LLMs in academic writing and research. If LLMs are used to generate citations, they may perpetuate existing biases in the scholarly literature, making it harder for new or lesser-known work to gain recognition. This could potentially skew the direction of research and limit the diversity of ideas being explored.

Technical Explanation

The researchers designed a series of experiments to investigate how LLMs generate citations and the biases inherent in their citation patterns. They used a large language model, GPT-3, to generate citations for a set of target papers, and then analyzed the properties of the generated citations.

Their analysis revealed two key biases in the model's citation behavior:

Preferential Attachment: The LLM was more likely to cite papers that already had a large number of citations, reflecting a "rich-get-richer" dynamic.
Matthew Effect: The model also exhibited a tendency to cite well-known, highly cited papers more frequently than lesser-known work, even when the lesser-known papers were more relevant to the target paper.

These biases were found to be more pronounced in the LLM-generated citations compared to the actual human-written citations for the same target papers. In other words, the model amplified the existing citation biases present in the training data.

The researchers also explored potential remedies for these biases, such as incorporating citation network information or using weighting schemes to encourage more diverse citations. However, they acknowledge that these biases are deeply rooted in the underlying data and may be challenging to mitigate fully.

Critical Analysis

The researchers provide a thorough and well-designed study that sheds light on an important issue in the use of large language models for academic tasks. By analyzing the citation patterns generated by GPT-3, they have uncovered biases that could have significant implications for the use of these models in scholarly writing and research.

One potential limitation of the study is the use of a single language model (GPT-3) and a specific set of target papers. While the findings are likely to generalize to other LLMs, it would be valuable to explore how these biases manifest in a broader range of models and tasks.

Additionally, while the researchers suggest potential remedies for the biases, such as incorporating citation network information, it remains to be seen how effective these approaches would be in practice. Further research is needed to develop robust strategies for mitigating citation biases in LLM-generated content.

Another area for further exploration could be the potential impact of these biases on the broader scholarly ecosystem, such as the way research is discovered, evaluated, and cited. Understanding the systemic implications of heightened citation biases in LLM-assisted research is crucial for ensuring the responsible development and deployment of these powerful AI tools.

Conclusion

This research paper provides valuable insights into the citation biases inherent in large language models. By demonstrating how LLMs can amplify existing biases in human-written citations, the study highlights the need for careful consideration of these models' behaviors and potential pitfalls when used in academic contexts.

As LLMs continue to be explored for a wide range of applications, including academic writing and research support, understanding and mitigating the biases they exhibit will be crucial for ensuring the integrity and diversity of scholarly discourse. This paper lays the groundwork for further research and development in this important area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models

Sowmya S. Sundaram, Benjamin Solomon, Avani Khatri, Anisha Laumas, Purvesh Khatri, Mark A. Musen

Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.

4/10/2024

cs.AI cs.CL cs.IR

An Empirical Analysis on Large Language Models in Debate Evaluation

Xinyi Liu, Pinxin Liu, Hangfeng He

In this study, we investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We discover that LLM's performance exceeds humans and surpasses the performance of state-of-the-art methods fine-tuned on extensive datasets in debate evaluation. We additionally explore and analyze biases present in LLMs, including positional bias, lexical bias, order bias, which may affect their evaluative judgments. Our findings reveal a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented, attributed to prompt design. We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential, highlighting the critical need for careful label verbalizer selection in prompt design. Additionally, our analysis indicates a tendency of both models to favor the debate's concluding side as the winner, suggesting an end-of-discussion bias.

6/5/2024

cs.CL cs.AI

💬

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Fan Gao, Hang Jiang, Rui Yang, Qingcheng Zeng, Jinghui Lu, Moritz Blum, Dairui Liu, Tianwei She, Yuang Jiang, Irene Li

Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.

5/24/2024

cs.CL

💬

Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

5/24/2024

cs.CL cs.AI