A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Read original: arXiv:2405.02559 - Published 9/25/2024 by Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu and 5 others

💬

Overview

This paper reviews the existing literature on human evaluation methodologies for large language models (LLMs) in healthcare applications.
The authors highlight the need for a standardized and consistent approach to human evaluation, as the current methods are often cumbersome, time-consuming, and non-standardized.
The review covers various aspects of human evaluation, including evaluation dimensions, sample types and sizes, evaluator selection and recruitment, frameworks and metrics, the evaluation process, and statistical analysis.
The authors propose a comprehensive framework called QUEST (Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence) to improve the reliability, generalizability, and applicability of human evaluation of generative LLMs in healthcare.

Plain English Explanation

As generative AI and LLMs become more widely used in healthcare, it's crucial to have human experts evaluate the generated texts in addition to automated evaluations. This helps ensure the safety, reliability, and effectiveness of these AI systems. However, the current human evaluation methods can be time-consuming and inconsistent, which makes it difficult to widely adopt LLMs in healthcare.

This study reviews the existing research on how humans evaluate LLMs in healthcare. The authors found that there is a need for a standardized and consistent approach to human evaluation. They looked at studies published from 2018 to 2024 and examined how these evaluations were conducted, including factors like the types of medical specialties, the dimensions used to evaluate the LLMs, the selection of evaluators, and the statistical analysis of the results.

Based on the diverse evaluation strategies used in these studies, the authors propose a new framework called QUEST. This framework covers five key areas: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. The goal is to make human evaluation of generative LLMs in healthcare more reliable, consistent, and applicable across different medical applications.

Technical Explanation

The paper reviews the existing literature on human evaluation methodologies for large language models (LLMs) in healthcare applications. The authors conducted an extensive literature search, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, to identify relevant studies published from January 2018 to February 2024.

The review examines the human evaluation of LLMs across various medical specialties, addressing factors such as:

Evaluation dimensions: The specific aspects of the LLMs that are evaluated, such as the quality of information, understanding and reasoning, expression style and persona, safety and harm, and trust and confidence.
Sample types and sizes: The types of medical texts or tasks used for evaluation and the number of samples evaluated.
Evaluator selection and recruitment: How the human evaluators were chosen and their expertise or background.
Frameworks and metrics: The evaluation frameworks and metrics used to assess the LLM performance.
Evaluation process: The specific steps and procedures involved in the human evaluation.
Statistical analysis: The methods used to analyze the results of the human evaluation.

Drawing from the diverse evaluation strategies highlighted in these studies, the authors propose the QUEST framework as a comprehensive and practical approach for human evaluation of generative LLMs in healthcare. This framework aims to improve the reliability, generalizability, and applicability of human evaluation by defining clear evaluation dimensions and offering detailed guidelines.

Critical Analysis

The paper's strength lies in its comprehensive review of the existing literature on human evaluation methodologies for LLMs in healthcare. The authors have identified a significant need for a standardized and consistent approach, which is a crucial step in ensuring the safe and effective deployment of these AI systems in the healthcare domain.

One potential limitation of the study is the relatively short time frame of the literature review, which only covers publications from 2018 to 2024. While this is understandable given the rapid pace of development in this field, it may have overlooked earlier or relevant works that could have provided additional insights.

Additionally, the authors do not delve into the potential biases or limitations inherent in human evaluation, which could be an area for further exploration. The proposed QUEST framework, while comprehensive, may also benefit from empirical validation and refinement based on real-world deployments and feedback from healthcare practitioners.

Moreover, the paper does not address the potential challenges in implementing a standardized human evaluation approach, such as the resources and expertise required, the integration with existing healthcare workflows, and the potential resistance to change from healthcare professionals.

Despite these minor limitations, the paper provides a valuable contribution to the field by highlighting the importance of human evaluation in the context of LLMs in healthcare and proposing a promising framework to address the current challenges.

Conclusion

This paper reviews the existing literature on human evaluation methodologies for large language models (LLMs) in healthcare applications. The authors identify a significant need for a standardized and consistent approach to human evaluation, as the current methods are often cumbersome, time-consuming, and non-standardized.

The proposed QUEST framework aims to improve the reliability, generalizability, and applicability of human evaluation of generative LLMs in healthcare. By defining clear evaluation dimensions and offering detailed guidelines, the QUEST framework has the potential to facilitate the safe and effective deployment of LLMs in various medical specialties.

Overall, this study provides a valuable contribution to the field by highlighting the importance of human evaluation in the context of LLMs in healthcare and proposing a comprehensive framework to address the current challenges in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

A Literature Review and Framework for Human Evaluation of Generative Large Language Models in Healthcare

Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, Yanshan Wang

With generative artificial intelligence (AI), particularly large language models (LLMs), continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024. The review examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Drawing on the diverse evaluation strategies employed in these studies, we propose a comprehensive and practical framework for human evaluation of LLMs: QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.

9/25/2024

💬

Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, John Torous

Large language models (LLMs) are emerging as promising tools for mental health care, offering scalable support through their ability to generate human-like responses. However, the effectiveness of these models in clinical settings remains unclear. This scoping review aimed to assess the current generative applications of LLMs in mental health care, focusing on studies where these models were tested with human participants in real-world scenarios. A systematic search across APA PsycNet, Scopus, PubMed, and Web of Science identified 726 unique articles, of which 17 met the inclusion criteria. These studies encompassed applications such as clinical assistance, counseling, therapy, and emotional support. However, the evaluation methods were often non-standardized, with most studies relying on ad hoc scales that limit comparability and robustness. Privacy, safety, and fairness were also frequently underexplored. Moreover, reliance on proprietary models, such as OpenAI's GPT series, raises concerns about transparency and reproducibility. While LLMs show potential in expanding mental health care access, especially in underserved areas, the current evidence does not fully support their use as standalone interventions. More rigorous, standardized evaluations and ethical oversight are needed to ensure these tools can be safely and effectively integrated into clinical practice.

8/22/2024

ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models

Aparna Elangovan, Ling Liu, Lei Xu, Sravan Bodapati, Dan Roth

In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.

9/4/2024

💬

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria

The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.

6/12/2024