Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis

Read original: arXiv:2407.15862 - Published 7/24/2024 by Qiuhong Wei, Ying Cui, Mengwei Ding, Yanqin Wang, Lingling Xiang, Zhengxiong Yao, Ceran Chen, Ying Long, Zhezhen Jin, Ximing Xu

🚀

Overview

Large language models (LLMs) have potential applications in medicine, but data privacy and computational requirements limit their use in healthcare settings.
Lightweight, open-source LLMs could provide a solution, but their performance in pediatric settings has not been extensively studied.
This study compared the performance of several LLMs, including lightweight and large-scale models, in answering pediatric healthcare questions.

Plain English Explanation

Open-source and lightweight language models are emerging as potential alternatives to large, proprietary language models like ChatGPT. These smaller models could be more practical for use in healthcare institutions, which often have concerns about data privacy and the computational power required for large language models.

In this study, the researchers wanted to see how well these lightweight models performed in answering questions about pediatric healthcare, compared to both larger language models and the widely-used ChatGPT. They randomly selected 250 questions from an online medical forum, covering 25 different pediatric departments, and had several language models - including the lightweight ChatGLM3-6B and Vicuna-7B, the larger Vicuna-13B, and ChatGPT-3.5 - provide answers in Chinese.

The researchers found that the lightweight ChatGLM3-6B model demonstrated higher accuracy and completeness than the larger Vicuna models, with over 98% of its responses being rated as safe. Repeating the questions confirmed these results.

This suggests that lightweight language models could be a promising option for use in pediatric healthcare settings, where data privacy and computational constraints are important considerations. However, the study also highlighted the need for continued development to narrow the performance gap between these lightweight models and larger, proprietary alternatives like ChatGPT.

Technical Explanation

This cross-sectional study evaluated the performance of several language models, including lightweight open-source models and larger-scale proprietary models, in answering pediatric healthcare questions. The researchers randomly selected 250 patient consultation questions from a public online medical forum, with 10 questions from each of 25 pediatric departments, covering the period from December 1, 2022, to October 30, 2023.

The language models tested were the lightweight open-source ChatGLM3-6B and Vicuna-7B, the larger Vicuna-13B, and the widely-used proprietary ChatGPT-3.5. These models independently provided answers to the questions in Chinese between November 1, 2023, and November 7, 2023. To assess reproducibility, each inquiry was replicated once.

The researchers found that the ChatGLM3-6B model demonstrated higher accuracy and completeness than the Vicuna-13B and Vicuna-7B models (p < .05), with over 98.4% of its responses being rated as safe. The repetition of inquiries confirmed these findings.

These results suggest that lightweight, open-source language models can be a promising solution for pediatric healthcare applications, where data privacy and computational burden are important considerations. However, the observed gap in performance between the lightweight models and the larger-scale proprietary ChatGPT-3.5 underscores the need for continued development efforts to further improve the capabilities of these open-source alternatives.

Critical Analysis

The study provides encouraging evidence that lightweight, open-source language models can be a viable option for pediatric healthcare applications, where data privacy and computational constraints are key concerns. The researchers' finding that the ChatGLM3-6B model outperformed the larger Vicuna models in terms of accuracy and completeness is particularly notable.

However, the study also highlights the ongoing need for further development and improvement of these open-source models. The observed performance gap between the lightweight models and the larger, proprietary ChatGPT-3.5 suggests that there is still room for advancement in the capabilities of the open-source alternatives.

Additionally, the study was limited to a specific set of pediatric healthcare questions and a relatively short time period. Further research would be needed to assess the models' performance across a wider range of medical domains and over a longer time frame to ensure the reliability and generalizability of the findings.

It would also be valuable to explore the potential biases or limitations of the language models, particularly in the context of sensitive healthcare information and vulnerable pediatric populations. Ensuring the safety and ethical use of these technologies in medical settings should be a key priority.

Conclusion

This study demonstrates the potential for lightweight, open-source language models to play a role in pediatric healthcare, where data privacy and computational burden are significant concerns. The superior performance of the ChatGLM3-6B model compared to larger-scale alternatives suggests that these more lightweight options could be a promising solution.

However, the observed gap in capabilities between the open-source and proprietary models underscores the need for continued development and improvement of the open-source alternatives. Ongoing research and collaboration between the technology and healthcare sectors will be crucial to further refine these models and ensure their safe and effective deployment in clinical settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis

Qiuhong Wei, Ying Cui, Mengwei Ding, Yanqin Wang, Lingling Xiang, Zhengxiong Yao, Ceran Chen, Ying Long, Zhezhen Jin, Ximing Xu

Large language models (LLMs) have demonstrated potential applications in medicine, yet data privacy and computational burden limit their deployment in healthcare institutions. Open-source and lightweight versions of LLMs emerge as potential solutions, but their performance, particularly in pediatric settings remains underexplored. In this cross-sectional study, 250 patient consultation questions were randomly selected from a public online medical forum, with 10 questions from each of 25 pediatric departments, spanning from December 1, 2022, to October 30, 2023. Two lightweight open-source LLMs, ChatGLM3-6B and Vicuna-7B, along with a larger-scale model, Vicuna-13B, and the widely-used proprietary ChatGPT-3.5, independently answered these questions in Chinese between November 1, 2023, and November 7, 2023. To assess reproducibility, each inquiry was replicated once. We found that ChatGLM3-6B demonstrated higher accuracy and completeness than Vicuna-13B and Vicuna-7B (P .05), with over 98.4% of responses being rated as safe. Repetition of inquiries confirmed these findings. In conclusion, Lightweight LLMs demonstrate promising application in pediatric healthcare. However, the observed gap between lightweight and large-scale proprietary LLMs underscores the need for continued development efforts.

7/24/2024

💬

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Yuhao Chen, Zhimu Wang, Bo Wen, Farhana Zulkernine

Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on the performance of different LLMs, specifically for domain-specific data such as medical chart notes. We propose an evaluation approach to analyze the performance of open-source LLMs such as Llama2 and Mistral for medical summarization tasks, using GPT-4 as an assessor. Our innovative approach to quantitative evaluation of LLMs can enable quality control, support the selection of effective LLMs for specific tasks, and advance knowledge discovery in digital health.

5/31/2024

Large Language Models in Healthcare: A Comprehensive Benchmark

Andrew Liu, Hongjian Zhou, Yining Hua, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.

6/27/2024

💬

Answering real-world clinical questions using large language model based systems

Yen Sia Low (Atropos Health, New York NY, USA), Michael L. Jackson (Atropos Health, New York NY, USA), Rebecca J. Hyde (Atropos Health, New York NY, USA), Robert E. Brown (Atropos Health, New York NY, USA), Neil M. Sanghavi (Atropos Health, New York NY, USA), Julian D. Baldwin (Atropos Health, New York NY, USA), C. William Pike (Atropos Health, New York NY, USA), Jananee Muralidharan (Atropos Health, New York NY, USA), Gavin Hui (Atropos Health, New York NY, USA, Department of Medicine, University of California, Los Angeles CA, USA), Natasha Alexander (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Hadeel Hassan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Rahul V. Nene (Department of Emergency Medicine, University of California, San Diego CA, USA), Morgan Pike (Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA), Courtney J. Pokrzywa (Department of Surgery, Columbia University, New York NY, USA), Shivam Vedak (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Adam Paul Yan (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Dong-han Yao (Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Amy R. Zipursky (Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada), Christina Dinh (Atropos Health, New York NY, USA), Philip Ballentine (Atropos Health, New York NY, USA), Dan C. Derieg (Atropos Health, New York NY, USA), Vladimir Polony (Atropos Health, New York NY, USA), Rehan N. Chawdry (Atropos Health, New York NY, USA), Jordan Davies (Atropos Health, New York NY, USA), Brigham B. Hyde (Atropos Health, New York NY, USA), Nigam H. Shah (Atropos Health, New York NY, USA, Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA), Saurabh Gombar (Atropos Health, New York NY, USA, Department of Pathology, Stanford University, Stanford CA, USA)

Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

7/2/2024