The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

2404.15667

Published 5/9/2024 by Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mantyla

The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

Abstract

Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.

Create account to get full access

Overview

The paper explores the promise and challenges of using large language models (LLMs) to accelerate the screening process for systematic reviews.
Systematic reviews are a crucial tool in evidence-based decision-making, but the screening process can be time-consuming and labor-intensive.
The researchers investigate the potential of LLMs, like ChatGPT and GPT-3.5/4, to assist in this process by automating certain tasks.

Plain English Explanation

The paper looks at how new AI language models, like ChatGPT and GPT-3.5/4, could help speed up the process of systematic reviews. Systematic reviews are important for making decisions based on research evidence, but the initial step of screening all the relevant studies can be very time-consuming and tedious work.

The researchers wanted to see if these powerful AI language models could take on some of that screening work, helping researchers find the most relevant studies more efficiently. They explored different ways the AI models could potentially assist, like summarizing study abstracts or identifying key information.

The goal is to see if these AI tools can make the overall systematic review process faster and more manageable, without sacrificing quality or introducing new problems. The paper dives into both the promising opportunities and the challenges that come with using these language models in this research context.

Technical Explanation

The paper investigates the potential of using large language models (LLMs), such as ChatGPT and GPT-3.5/4, to accelerate the screening process for systematic reviews. Systematic reviews are a crucial evidence-synthesis methodology, but the initial screening of potentially relevant studies can be time-consuming and labor-intensive.

The researchers explore various ways LLMs could assist in this process, including:

Summarizing study abstracts to help researchers quickly assess relevance
Identifying key information (e.g. study design, population, outcomes) to prioritize studies
Classifying studies as potentially includable or excludable based on predefined criteria

The paper discusses the potential benefits of using LLMs, such as increased efficiency, reduced workload for human reviewers, and the ability to handle large volumes of literature. However, it also delves into the challenges, such as the need for high-quality training data, potential biases in the LLM outputs, and the difficulty of integrating the AI tools into existing systematic review workflows.

To further the research in this area, the authors recommend developing benchmarks and datasets specifically for evaluating the performance of LLMs in systematic review tasks, as well as exploring methods for seamless human-AI collaboration.

Critical Analysis

The paper provides a comprehensive overview of the promise and challenges of using LLMs to accelerate the systematic review process. The researchers thoughtfully consider both the potential benefits and the limitations of this approach.

One strength of the paper is its nuanced discussion of the challenges. The authors acknowledge the need for high-quality training data, the risk of biases in the LLM outputs, and the difficulty of integrating these tools into existing workflows. These are important considerations that must be addressed before LLMs can be reliably deployed in this context.

However, the paper could have delved deeper into some of the specific limitations and caveats. For example, it does not explore the potential for LLMs to make mistakes in identifying relevant studies or summarizing key information, and how that could impact the integrity of the systematic review.

Additionally, the paper does not address the ethical implications of using AI systems in a domain as critical as evidence-based decision-making. Issues around transparency, accountability, and the need for human oversight should be more thoroughly discussed.

Overall, the paper serves as a valuable starting point for further research in this area. By highlighting both the promise and the challenges, it encourages readers to think critically about the use of LLMs in systematic reviews and to approach this technology with appropriate caution and diligence.

Conclusion

This paper provides a nuanced examination of the potential for using large language models (LLMs) to accelerate the screening process in systematic reviews. While the researchers identify promising opportunities, such as improved efficiency and reduced workload for human reviewers, they also thoroughly explore the challenges that must be addressed.

Key challenges include the need for high-quality training data, the risk of biases in LLM outputs, and the difficulty of seamlessly integrating these AI tools into existing systematic review workflows. The authors also suggest the importance of developing specialized benchmarks and datasets to better evaluate LLM performance in this domain.

Overall, the paper serves as a valuable resource for researchers and practitioners interested in leveraging AI to enhance evidence-based decision-making. By thoughtfully considering both the upsides and the downsides, it encourages the scientific community to approach this technology with the appropriate level of caution and diligence. As the field of AI-assisted research synthesis continues to evolve, this paper provides a solid foundation for further exploration and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning

Teo Susnjak, Peter Hwang, Napoleon H. Reyes, Andre L. C. Barczak, Timothy R. McIntosh, Surangika Ranathunga

This research pioneers the use of fine-tuned Large Language Models (LLMs) to automate Systematic Literature Reviews (SLRs), presenting a significant and novel contribution in integrating AI to enhance academic research methodologies. Our study employed the latest fine-tuning methodologies together with open-sourced LLMs, and demonstrated a practical and efficient approach to automating the final execution stages of an SLR process that involves knowledge synthesis. The results maintained high fidelity in factual accuracy in LLM responses, and were validated through the replication of an existing PRISMA-conforming SLR. Our research proposed solutions for mitigating LLM hallucination and proposed mechanisms for tracking LLM responses to their sources of information, thus demonstrating how this approach can meet the rigorous demands of scholarly research. The findings ultimately confirmed the potential of fine-tuned LLMs in streamlining various labor-intensive processes of conducting literature reviews. Given the potential of this approach and its applicability across all research domains, this foundational study also advocated for updating PRISMA reporting guidelines to incorporate AI-driven processes, ensuring methodological transparency and reliability in future SLRs. This study broadens the appeal of AI-enhanced tools across various academic and research fields, setting a new standard for conducting comprehensive and accurate literature reviews with more efficiency in the face of ever-increasing volumes of academic studies.

4/16/2024

cs.CL cs.DL cs.IR

💬

Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, James Thomas

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

5/24/2024

cs.CL cs.AI

A Comparative Study of Quality Evaluation Methods for Text Summarization

Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding

Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.

7/2/2024

cs.CL cs.AI

Using LLMs in Software Requirements Specifications: An Empirical Evaluation

Madhava Krishna, Bhagesh Gaur, Arsh Verma, Pankaj Jalote

The creation of a Software Requirements Specification (SRS) document is important for any software development project. Given the recent prowess of Large Language Models (LLMs) in answering natural language queries and generating sophisticated textual outputs, our study explores their capability to produce accurate, coherent, and structured drafts of these documents to accelerate the software development lifecycle. We assess the performance of GPT-4 and CodeLlama in drafting an SRS for a university club management system and compare it against human benchmarks using eight distinct criteria. Our results suggest that LLMs can match the output quality of an entry-level software engineer to generate an SRS, delivering complete and consistent drafts. We also evaluate the capabilities of LLMs to identify and rectify problems in a given requirements document. Our experiments indicate that GPT-4 is capable of identifying issues and giving constructive feedback for rectifying them, while CodeLlama's results for validation were not as encouraging. We repeated the generation exercise for four distinct use cases to study the time saved by employing LLMs for SRS generation. The experiment demonstrates that LLMs may facilitate a significant reduction in development time for entry-level software engineers. Hence, we conclude that the LLMs can be gainfully used by software engineers to increase productivity by saving time and effort in generating, validating and rectifying software requirements.

4/30/2024

cs.SE cs.AI