Efficacy of Large Language Models in Systematic Reviews

Read original: arXiv:2408.04646 - Published 8/12/2024 by Aaditya Shah, Shridhar Mehendale, Siddha Kanthi

Efficacy of Large Language Models in Systematic Reviews

Overview

Large language models (LLMs) have shown promising potential in various applications, including systematic reviews and scientific literature analysis.
This paper explores the efficacy of using LLMs for tasks in systematic reviews, such as document screening, data extraction, and quality assessment.
The researchers conducted experiments to evaluate the performance of LLMs compared to human experts in these systematic review tasks.

Plain English Explanation

Large language models (LLMs) are powerful artificial intelligence systems that can understand and generate human-like text. Researchers have been exploring how these LLMs can be used to streamline the process of conducting systematic reviews, which are in-depth analyses of all the available research on a particular topic.

In this paper, the researchers wanted to see how well LLMs could perform tasks that are typically done by human experts when conducting systematic reviews. These tasks include:

Document screening: Quickly identifying which research papers are relevant to the review based on their titles and abstracts.
Data extraction: Pulling key information from the full text of the relevant papers, such as the research methods and findings.
Quality assessment: Evaluating the overall quality and reliability of the included studies.

The researchers set up experiments to compare the performance of LLMs to that of human experts on these systematic review tasks. They wanted to see if the LLMs could match or even outperform the humans, which could make the systematic review process more efficient and scalable.

Technical Explanation

The researchers conducted a series of experiments to evaluate the efficacy of using large language models (LLMs) for tasks in systematic reviews. They compared the performance of LLMs to that of human experts in document screening, data extraction, and quality assessment.

For document screening, the researchers used an LLM to classify the relevance of article titles and abstracts, and compared its performance to that of human reviewers. They found that the LLM was able to achieve comparable accuracy in identifying relevant studies while significantly reducing the time and effort required.

In the data extraction task, the LLM was used to automatically extract key information from the full-text of the included studies, such as research methods and findings. The extracted data was then compared to the manually curated data, and the LLM demonstrated high accuracy in capturing the most salient details.

Finally, for quality assessment, the researchers had the LLM evaluate the methodological rigor and reliability of the included studies, and compared its assessments to those made by human experts. The results showed that the LLM was able to provide quality ratings that were well-aligned with the human judgments.

Critical Analysis

The researchers acknowledge several limitations and caveats in their study. First, the LLM models used were pre-trained on general text data, and the researchers note that further fine-tuning on domain-specific systematic review data could potentially improve the model's performance.

Additionally, the study was conducted on a relatively small set of systematic reviews in the environmental, social, and governance (ESG) domain. More research is needed to evaluate the generalizability of these findings to other domains and types of systematic reviews.

The researchers also highlight the importance of maintaining human oversight and input, even when utilizing powerful LLM tools. They emphasize that the LLM should be used as an augmentative tool to assist human experts, rather than as a replacement for their expertise and judgment.

Finally, the researchers caution that the use of LLMs in systematic reviews raises important ethical considerations, such as the potential for biases and the need for transparency and accountability in the model's decision-making processes.

Conclusion

This paper provides promising evidence for the potential of large language models to streamline and enhance the systematic review process. The LLMs demonstrated strong performance in tasks like document screening, data extraction, and quality assessment, suggesting they could significantly reduce the time and effort required by human experts.

However, the researchers also emphasize the importance of thoughtful implementation and ongoing human oversight. LLMs should be used as complementary tools to augment, rather than replace, the expertise of human reviewers. Additionally, further research is needed to explore the broader applicability of these findings and address potential ethical concerns.

Overall, this study highlights the exciting possibilities of leveraging advanced AI technologies like LLMs to improve the efficiency and rigor of systematic reviews, with important implications for evidence-based decision-making across various domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Efficacy of Large Language Models in Systematic Reviews

Aaditya Shah, Shridhar Mehendale, Siddha Kanthi

This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature through a systematic review of the relationship between Environmental, Social, and Governance (ESG) factors and financial performance. The primary objective is to assess how LLMs can replicate a systematic review on a corpus of ESG-focused papers. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. Additionally, we used a set of 238 papers from a previous systematic review of ESG literature from January 2015 to February 2020. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations relative to human-made classifications on both sets of papers. We then compared these results to a Custom GPT and a fine-tuned GPT-4o Mini model using the corpus of 238 papers as training data. The fine-tuned GPT-4o Mini model outperformed the base LLMs by 28.3% on average in overall accuracy on prompt 1. At the same time, the Custom GPT showed a 3.0% and 15.7% improvement on average in overall accuracy on prompts 2 and 3, respectively. Our findings reveal promising results for investors and agencies to leverage LLMs to summarize complex evidence related to ESG investing, thereby enabling quicker decision-making and a more efficient market.

8/12/2024

💬

The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert

Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

9/10/2024

💬

From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li, Huaikang Zhou, Mingze Xu

This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.

8/13/2024

Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews

Lucas Joos, Daniel A. Keim, Maximilian T. Fischer

In academic research, systematic literature reviews are foundational and highly relevant, yet tedious to create due to the high volume of publications and labor-intensive processes involved. Systematic selection of relevant papers through conventional means like keyword-based filtering techniques can sometimes be inadequate, plagued by semantic ambiguities and inconsistent terminology, which can lead to sub-optimal outcomes. To mitigate the required extensive manual filtering, we explore and evaluate the potential of using Large Language Models (LLMs) to enhance the efficiency, speed, and precision of literature review filtering, reducing the amount of manual screening required. By using models as classification agents acting on a structured database only, we prevent common problems inherent in LLMs, such as hallucinations. We evaluate the real-world performance of such a setup during the construction of a recent literature survey paper with initially more than 8.3k potentially relevant articles under consideration and compare this with human performance on the same dataset. Our findings indicate that employing advanced LLMs like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, or Llama3 with simple prompting can significantly reduce the time required for literature filtering - from usually weeks of manual research to only a few minutes. Simultaneously, we crucially show that false negatives can indeed be controlled through a consensus scheme, achieving recalls >98.8% at or even beyond the typical human error threshold, thereby also providing for more accurate and relevant articles selected. Our research not only demonstrates a substantial improvement in the methodology of literature reviews but also sets the stage for further integration and extensive future applications of responsible AI in academic research practices.

7/16/2024