Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Read original: arXiv:2405.14766 - Published 5/24/2024 by Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain, Stelios Georgiou and 10 others

💬

Overview

This paper explores the potential of large language models (LLMs) to support public health tasks, such as classifying and extracting information from free text.
The researchers evaluated the performance of several open-weight LLMs (7-70 billion parameters) on a range of public health-related tasks, including assessing health burden, epidemiological risk factors, and public health interventions.
The study found that the LLama-3-70B-Instruct model performed the best overall, outperforming other models on 15 out of 17 tasks.
The researchers also compared the performance of LLMs to GPT-4, finding that they achieved comparable results on a subset of 12 tasks.

Plain English Explanation

Large language models (LLMs) are powerful artificial intelligence systems that can be used to process and understand natural language. Researchers have been exploring how these models can be applied to support public health efforts, such as analyzing text-based information related to health issues, disease outbreaks, and public health interventions.

In this study, the researchers evaluated the performance of several LLMs on a variety of public health-related tasks. They used a combination of existing datasets and newly annotated datasets to test the models' ability to classify and extract information from free-form text. For example, they asked the models to identify whether a piece of text was discussing the health burden of a particular disease, the risk factors associated with an illness, or the details of a public health intervention.

The researchers found that the LLama-3-70B-Instruct model performed the best overall, outperforming other models on the majority of the tasks. However, they also noted that the models struggled with some more challenging tasks, such as correctly classifying whether a piece of text was discussing contact tracing efforts.

When the researchers compared the LLMs to the powerful GPT-4 model, they found that the LLMs achieved comparable results on a subset of the tasks. This suggests that these large language models could potentially be useful tools for public health experts to help them extract information from a wide variety of text-based sources, such as news reports, research papers, and social media posts.

Overall, this research provides promising evidence that LLMs may be able to support public health surveillance, research, and interventions by automating the analysis of large amounts of textual data. However, the researchers also acknowledge that there is still room for improvement, particularly on more complex tasks.

Technical Explanation

The researchers in this study evaluated the performance of several open-weight large language models (LLMs) on a range of public health-related tasks. They combined six externally annotated datasets and seven new internally annotated datasets to create a comprehensive evaluation framework covering three main areas: health burden, epidemiological risk factors, and public health interventions.

The researchers initially tested five open-weight LLMs, ranging from 7 to 70 billion parameters, using a zero-shot in-context learning approach. They found that the LLama-3-70B-Instruct model outperformed the other LLMs, achieving the best results on 15 out of the 17 tasks based on micro-F1 scores.

The performance of the LLMs varied significantly across the different tasks. While all the models scored above 80% micro-F1 on some tasks, such as classifying gastrointestinal illness, they all scored below 60% micro-F1 on more challenging tasks, such as contact classification.

For a subset of 12 tasks, the researchers also evaluated the performance of the GPT-4 model, which is known for its strong language understanding capabilities. They found that the LLama-3-70B-Instruct model performed equally well or better than GPT-4 on 6 of the 12 tasks.

Overall, the results of this study suggest that LLMs may be valuable tools for public health experts to extract information from a wide variety of free-text sources, supporting efforts in areas such as public health surveillance, research, and interventions. However, the researchers also acknowledge that there is still room for improvement, particularly on more complex tasks.

Critical Analysis

The researchers in this study provide a comprehensive evaluation of LLMs for public health-related tasks, which is an important contribution to the growing body of research on the use of large language models in healthcare and medicine.

One strength of the study is the diverse range of tasks and datasets used to assess the LLMs' performance. By combining existing and newly annotated datasets, the researchers were able to create a robust evaluation framework that covered a variety of public health-related topics and challenges.

However, the study also highlights the limitations of current LLMs, particularly on more complex tasks. The researchers noted that all the models struggled with certain tasks, such as contact classification, which may require more specialized domain knowledge or reasoning capabilities.

Additionally, the researchers did not provide detailed explanations for the performance differences between the LLMs and GPT-4. It would be interesting to know more about the specific strengths and weaknesses of each model, as well as the potential reasons behind their relative performance on the different tasks.

Another area for further research could be exploring ways to fine-tune or adapt the LLMs to improve their performance on public health-related tasks. The survey of large language models in medicine highlights the potential benefits of specialized training or fine-tuning for specific medical applications.

Overall, this study provides a valuable contribution to the ongoing research on the use of large language models in healthcare and public health. While the results are promising, they also underscore the need for continued development and refinement of these models to fully realize their potential in supporting critical public health efforts.

Conclusion

This study presents a comprehensive evaluation of the performance of large language models (LLMs) on a range of public health-related tasks, including classifying and extracting information from free-form text. The researchers found that the LLama-3-70B-Instruct model outperformed other open-weight LLMs, as well as the powerful GPT-4 model, on many of the tasks.

The results suggest that LLMs may be valuable tools for public health experts to help them process and analyze large amounts of text-based information, supporting efforts in areas such as disease surveillance, risk factor identification, and intervention planning. However, the researchers also acknowledge the limitations of current LLMs, particularly on more complex tasks, and emphasize the need for continued development and refinement of these models to fully realize their potential in the public health domain.

Overall, this study provides an important contribution to the ongoing research on the use of large language models in healthcare and medicine, and it highlights the promising opportunities as well as the challenges that lie ahead in leveraging these powerful AI systems to support critical public health initiatives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed, Thomas Finnie, Luke Hounsome, Steven Riley

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We initially evaluate five open-weight LLMs (7-70 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3-70B-Instruct is the highest performing model, achieving the best results on 15/17 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 12 tasks, we also evaluate GPT-4 and find comparable results to Llama-3-70B-Instruct, which scores equally or outperforms GPT-4 on 6 of the 12 tasks. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

5/24/2024

New!A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda

Large language models have shown promise in various domains, including healthcare. In this study, we conduct a comprehensive evaluation of LLMs in the context of mental health tasks using social media data. We explore the zero-shot (ZS) and few-shot (FS) capabilities of various LLMs, including GPT-4, Llama 3, Gemini, and others, on tasks such as binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Our evaluation involved 33 models testing 9 main prompt templates across the tasks. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. Moreover, prompt engineering played a crucial role in enhancing model performance. Notably, the Mixtral 8x22b model showed an improvement of over 20%, while Gemma 7b experienced a similar boost in performance. In the task of disorder severity evaluation, we observed that FS learning significantly improved the model's accuracy, highlighting the importance of contextual examples in complex assessments. Notably, the Phi-3-mini model exhibited a substantial increase in performance, with balanced accuracy improving by over 6.80% and mean average error dropping by nearly 1.3 when moving from ZS to FS learning. In the psychiatric knowledge task, recent models generally outperformed older, larger counterparts, with the Llama 3.1 405b achieving an accuracy of 91.2%. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering. Furthermore, the ethical guards imposed by many LLM providers hamper the ability to accurately evaluate their performance, due to tendency to not respond to potentially sensitive queries.

9/25/2024

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, Hae Won Park

Large language models (LLMs) are capable of many natural language tasks, yet they are far from perfect. In health applications, grounding and interpreting domain-specific and non-linguistic data is crucial. This paper investigates the capacity of LLMs to make inferences about health based on contextual information (e.g. user demographics, health knowledge) and physiological data (e.g. resting heart rate, sleep minutes). We present a comprehensive evaluation of 12 state-of-the-art LLMs with prompting and fine-tuning techniques on four public health datasets (PMData, LifeSnaps, GLOBEM and AW_FB). Our experiments cover 10 consumer health prediction tasks in mental health, activity, metabolic, and sleep assessment. Our fine-tuned model, HealthAlpaca exhibits comparable performance to much larger models (GPT-3.5, GPT-4 and Gemini-Pro), achieving the best performance in 8 out of 10 tasks. Ablation studies highlight the effectiveness of context enhancement strategies. Notably, we observe that our context enhancement can yield up to 23.8% improvement in performance. While constructing contextually rich prompts (combining user context, health knowledge and temporal information) exhibits synergistic improvement, the inclusion of health knowledge context in prompts significantly enhances overall performance.

4/30/2024

💬

Large language models in healthcare and medical domain: A review

Zabir Al Nazi, Wei Peng

The deployment of large language models (LLMs) within the healthcare sector has sparked both enthusiasm and apprehension. These models exhibit the remarkable capability to provide proficient responses to free-text queries, demonstrating a nuanced understanding of professional medical knowledge. This comprehensive survey delves into the functionalities of existing LLMs designed for healthcare applications, elucidating the trajectory of their development, starting from traditional Pretrained Language Models (PLMs) to the present state of LLMs in healthcare sector. First, we explore the potential of LLMs to amplify the efficiency and effectiveness of diverse healthcare applications, particularly focusing on clinical language understanding tasks. These tasks encompass a wide spectrum, ranging from named entity recognition and relation extraction to natural language inference, multi-modal medical applications, document classification, and question-answering. Additionally, we conduct an extensive comparison of the most recent state-of-the-art LLMs in the healthcare domain, while also assessing the utilization of various open-source LLMs and highlighting their significance in healthcare applications. Furthermore, we present the essential performance metrics employed to evaluate LLMs in the biomedical domain, shedding light on their effectiveness and limitations. Finally, we summarize the prominent challenges and constraints faced by large language models in the healthcare sector, offering a holistic perspective on their potential benefits and shortcomings. This review provides a comprehensive exploration of the current landscape of LLMs in healthcare, addressing their role in transforming medical applications and the areas that warrant further research and development.

7/9/2024