Can Public LLMs be used for Self-Diagnosis of Medical Conditions ?

2405.11407

Published 6/27/2024 by Nikil Sharan Prabahar Balasubramanian, Sagnik Dakshit

🤿

Abstract

Advancements in deep learning have generated a large-scale interest in the development of foundational deep learning models. The development of Large Language Models (LLM) has evolved as a transformative paradigm in conversational tasks, which has led to its integration and extension even in the critical domain of healthcare. With LLMs becoming widely popular and their public access through open-source models and integration with other applications, there is a need to investigate their potential and limitations. One such crucial task where LLMs are applied but require a deeper understanding is that of self-diagnosis of medical conditions based on bias-validating symptoms in the interest of public health. The widespread integration of Gemini with Google search and GPT-4.0 with Bing search has led to a shift in the trend of self-diagnosis using search engines to conversational LLM models. Owing to the critical nature of the task, it is prudent to investigate and understand the potential and limitations of public LLMs in the task of self-diagnosis. In this study, we prepare a prompt engineered dataset of 10000 samples and test the performance on the general task of self-diagnosis. We compared the performance of both the state-of-the-art GPT-4.0 and the fee Gemini model on the task of self-diagnosis and recorded contrasting accuracies of 63.07% and 6.01%, respectively. We also discuss the challenges, limitations, and potential of both Gemini and GPT-4.0 for the task of self-diagnosis to facilitate future research and towards the broader impact of general public knowledge. Furthermore, we demonstrate the potential and improvement in performance for the task of self-diagnosis using Retrieval Augmented Generation.

Create account to get full access

Overview

The paper explores the potential and limitations of using Large Language Models (LLMs) for the task of self-diagnosis of medical conditions.
The researchers prepare a dataset of 10,000 samples and compare the performance of GPT-4.0 and Gemini models on the self-diagnosis task.
The paper also discusses the challenges, limitations, and potential of using LLMs for self-diagnosis, and demonstrates the potential for improvement using Retrieval Augmented Generation.

Plain English Explanation

The paper examines the use of advanced language models, known as Large Language Models (LLMs), in the field of healthcare, particularly for the task of self-diagnosis of medical conditions. As LLMs become more widely available and integrated into popular search engines like Google and Bing, there is a growing trend of people using these models for self-diagnosis, which raises concerns about the accuracy and reliability of such diagnoses.

To investigate this, the researchers created a dataset of 10,000 samples and tested the performance of two LLMs, GPT-4.0 and Gemini, on the task of self-diagnosis. The results showed that the GPT-4.0 model achieved an accuracy of 63.07%, while the Gemini model had an accuracy of only 6.01%. This suggests that the GPT-4.0 model may be more suitable for self-diagnosis tasks, but there are still significant limitations and challenges that need to be addressed.

The paper also discusses the potential and limitations of using LLMs for self-diagnosis, highlighting the need for further research and development to ensure the reliability and safety of these systems, especially in the context of public health.

Furthermore, the researchers demonstrate the potential for improving the performance of self-diagnosis using a technique called Retrieval Augmented Generation, which combines the power of LLMs with additional information retrieval to enhance the accuracy of the diagnoses.

Technical Explanation

The paper investigates the use of Large Language Models (LLMs) for the task of self-diagnosis of medical conditions, which has become increasingly prevalent due to the widespread integration of these models in popular search engines like Google and Bing.

The researchers prepared a prompt-engineered dataset of 10,000 samples to test the performance of two LLMs, GPT-4.0 and Gemini, on the self-diagnosis task. The GPT-4.0 model achieved an accuracy of 63.07%, while the Gemini model had an accuracy of only 6.01%.

The paper discusses the challenges and limitations of using LLMs for self-diagnosis, including the potential for misdiagnosis, the need for reliable medical knowledge, and the potential for these systems to be misused or abused. The researchers also highlight the importance of considering the broader implications of LLMs in the healthcare domain, particularly in terms of public health.

To address these limitations, the researchers demonstrate the potential for improving the performance of self-diagnosis using Retrieval Augmented Generation, a technique that combines the power of LLMs with additional information retrieval to enhance the accuracy of the diagnoses.

Critical Analysis

The paper provides valuable insights into the potential and limitations of using Large Language Models (LLMs) for the task of self-diagnosis of medical conditions. The researchers' approach of creating a prompt-engineered dataset and testing the performance of two LLMs is a solid experimental design.

However, the paper does raise some concerns and areas for further research. The relatively low accuracy of the Gemini model (6.01%) suggests that not all LLMs are equally suitable for self-diagnosis tasks, and more research is needed to understand the specific requirements and limitations of these models in the healthcare domain.

Additionally, the paper acknowledges the potential for misdiagnosis and the need for reliable medical knowledge, which raises questions about the safety and trustworthiness of using LLMs for self-diagnosis, especially in the context of public health. The researchers' suggestion to explore Retrieval Augmented Generation is a promising approach, but further research is needed to fully understand its potential and limitations.

It would also be valuable for the researchers to explore the potential biases and ethical implications of using LLMs for self-diagnosis, as these models may perpetuate or amplify existing biases in the healthcare system, which could have serious consequences for patients.

Conclusion

The paper provides a valuable contribution to the ongoing research on the use of Large Language Models (LLMs) in the healthcare domain, particularly for the task of self-diagnosis of medical conditions. The researchers' findings suggest that while LLMs like GPT-4.0 have potential for self-diagnosis, there are significant limitations and challenges that need to be addressed.

The paper highlights the need for further research and development to ensure the reliability and safety of these systems, especially in the context of public health. The researchers' exploration of Retrieval Augmented Generation as a potential solution is a promising direction that could lead to improved performance and greater trust in LLM-based self-diagnosis systems.

Overall, this paper contributes to the growing body of research on the application of large language models in medicine and highlights the need for careful consideration of the ethical and societal implications of these technologies in the healthcare domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Gaurav Kumar Gupta, Aditi Singh, Sijo Valayakkad Manikandan, Abul Ehtesham

The recent swift development of LLMs like GPT-4, Gemini, and GPT-3.5 offers a transformative opportunity in medicine and healthcare, especially in digital diagnostics. This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses, and it demonstrates how each of these models could significantly increase diagnostic accuracy and efficiency. Through a series of diagnostic prompts based on symptoms from medical databases, GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Meanwhile, Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model when physicians are trying to make high-risk diagnoses. GPT-3.5, though slightly less advanced, is a good tool for medical diagnostics. This study highlights the need to study LLMs for healthcare and clinical practices with more care and attention, ensuring that any system utilizing LLMs promotes patient privacy and complies with health information privacy laws such as HIPAA compliance, as well as the social consequences that affect the varied individuals in complex healthcare contexts. This study marks the start of a larger future effort to study the various ways in which assigning ethical concerns to LLMs task of learning from human biases could unearth new ways to apply AI in complex medical settings.

5/14/2024

cs.CL cs.AI

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang

Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.

6/18/2024

cs.CL cs.AI

🤯

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

Duygu Altinok

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

5/8/2024

cs.CL

An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging

Sulaiman Khan, Md. Rafiul Biswas, Alina Murad, Hazrat Ali, Zubair Shah

Recent developments in multimodal large language models (MLLMs) have spurred significant interest in their potential applications across various medical imaging domains. On the one hand, there is a temptation to use these generative models to synthesize realistic-looking medical image data, while on the other hand, the ability to identify synthetic image data in a pool of data is also significantly important. In this study, we explore the potential of the Gemini (textit{gemini-1.0-pro-vision-latest}) and GPT-4V (gpt-4-vision-preview) models for medical image analysis using two modalities of medical image data. Utilizing synthetic and real imaging data, both Gemini AI and GPT-4V are first used to classify real versus synthetic images, followed by an interpretation and analysis of the input images. Experimental results demonstrate that both Gemini and GPT-4 could perform some interpretation of the input images. In this specific experiment, Gemini was able to perform slightly better than the GPT-4V on the classification task. In contrast, responses associated with GPT-4V were mostly generic in nature. Our early investigation presented in this work provides insights into the potential of MLLMs to assist with the classification and interpretation of retinal fundoscopy and lung X-ray images. We also identify key limitations associated with the early investigation study on MLLMs for specialized tasks in medical image analysis.

6/4/2024

eess.IV cs.AI cs.CV cs.LG