The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

Read original: arXiv:2409.04600 - Published 9/10/2024 by Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert

💬

Overview

This study aims to summarize the use of Large Language Models (LLMs) in the process of creating a scientific review.
It looks at the different stages of a review that can be automated and assesses the current state-of-the-art research projects in this field.

Plain English Explanation

The paper explores how Large Language Models (LLMs) can be used to help streamline the scientific review process. Scientific reviews are an important part of advancing knowledge, but they can be time-consuming and labor-intensive to create.

The researchers wanted to see which parts of the review process could be automated using LLMs, which are powerful AI models trained on vast amounts of text data. They looked at things like searching for relevant publications, extracting key data, and even generating draft text for the review itself.

The results suggest that LLMs can indeed help streamline many aspects of the review process, with the potential to make reviews more efficient and less resource-intensive to produce. However, the technology is still relatively new, and there is more research needed to fully assess the efficacy of LLMs for systematic reviews.

Technical Explanation

The researchers conducted a systematic search in June 2024 across several academic databases (PubMed, Scopus, Dimensions, Google Scholar) to identify research projects related to using LLMs for scientific review automation.

Screening and data extraction were assisted by an LLM add-on to the Covidence software, which leveraged the GPT-4 model. The extracted data was then cleaned and processed using ChatGPT, and ChatGPT and Scite.ai were used to draft most components of the manuscript (excluding methods and discussion).

Out of 3,788 articles retrieved, 172 studies were deemed eligible for the final review. The results showed that ChatGPT and other GPT-based LLMs were the most dominant architectures used for review automation (73.2% of the studies). Many of the projects focused on automating specific stages of the review process, such as searching for relevant publications (34.9%) and extracting key data (31.4%).

When comparing the performance of GPT-based and BERT-based models, the GPT-based models were generally more accurate for data extraction tasks (mean precision of 83.0% and recall of 86.0%), but slightly less accurate for title and abstract screening (mean accuracy of 77.3%).

Critical Analysis

The study provides a comprehensive overview of the current state of research on using LLMs for scientific review automation. The authors acknowledge that while the results are promising, the technology is still relatively new and there is more work needed to fully assess its efficacy for systematic reviews.

One potential limitation is that the search was conducted in June 2024, so the findings may not fully reflect the latest developments in this rapidly evolving field. Additionally, the study focused on a broad range of review automation projects, rather than deeper dives into the performance and limitations of specific LLM-based approaches.

Further research could explore the impact of different LLM architectures, as well as how LLMs perform compared to human reviewers across the various stages of the review process. Investigating the potential biases or errors introduced by LLMs in the review process would also be an important area for future study.

Conclusion

This study provides a valuable snapshot of the current state of research on using Large Language Models (LLMs) to automate various stages of the scientific review process. The results suggest that LLMs have significant potential to streamline and improve the efficiency of reviews, but there is still more work needed to fully assess their efficacy and limitations.

As the field of LLM research continues to rapidly evolve, it will be important for researchers and practitioners to closely follow the latest developments and explore how these powerful AI models can be leveraged to enhance the scientific review process and ultimately advance knowledge in a more efficient and effective manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert

Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

9/10/2024

💬

Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

5/24/2024

Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews

Lucas Joos, Daniel A. Keim, Maximilian T. Fischer

In academic research, systematic literature reviews are foundational and highly relevant, yet tedious to create due to the high volume of publications and labor-intensive processes involved. Systematic selection of relevant papers through conventional means like keyword-based filtering techniques can sometimes be inadequate, plagued by semantic ambiguities and inconsistent terminology, which can lead to sub-optimal outcomes. To mitigate the required extensive manual filtering, we explore and evaluate the potential of using Large Language Models (LLMs) to enhance the efficiency, speed, and precision of literature review filtering, reducing the amount of manual screening required. By using models as classification agents acting on a structured database only, we prevent common problems inherent in LLMs, such as hallucinations. We evaluate the real-world performance of such a setup during the construction of a recent literature survey paper with initially more than 8.3k potentially relevant articles under consideration and compare this with human performance on the same dataset. Our findings indicate that employing advanced LLMs like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, or Llama3 with simple prompting can significantly reduce the time required for literature filtering - from usually weeks of manual research to only a few minutes. Simultaneously, we crucially show that false negatives can indeed be controlled through a consensus scheme, achieving recalls >98.8% at or even beyond the typical human error threshold, thereby also providing for more accurate and relevant articles selected. Our research not only demonstrates a substantial improvement in the methodology of literature reviews but also sets the stage for further integration and extensive future applications of responsible AI in academic research practices.

7/16/2024

Automated Review Generation Method Based on Large Language Models

Shican Wu, Xiao Ma, Dehui Luo, Lulu Li, Xiangcheng Shi, Xin Chang, Xiaoyun Lin, Ran Luo, Chunlei Pei, Zhi-Jian Zhao, Jinlong Gong

Literature research, vital for scientific advancement, is overwhelmed by the vast ocean of available information. Addressing this, we propose an automated review generation method based on Large Language Models (LLMs) to streamline literature processing and reduce cognitive load. In case study on propane dehydrogenation (PDH) catalysts, our method swiftly generated comprehensive reviews from 343 articles, averaging seconds per article per LLM account. Extended analysis of 1041 articles provided deep insights into catalysts' composition, structure, and performance. Recognizing LLMs' hallucinations, we employed a multi-layered quality control strategy, ensuring our method's reliability and effective hallucination mitigation. Expert verification confirms the accuracy and citation integrity of generated reviews, demonstrating LLM hallucination risks reduced to below 0.5% with over 95% confidence. Released Windows application enables one-click review generation, aiding researchers in tracking advancements and recommending literature. This approach showcases LLMs' role in enhancing scientific research productivity and sets the stage for further exploration.

7/31/2024