Human-AI collectives produce the most accurate differential diagnoses

Read original: arXiv:2406.14981 - Published 6/24/2024 by N. Zoller, J. Berger, I. Lin, N. Fu, J. Komarneni, G. Barabucci, K. Laskowski, V. Shia, B. Harack, E. A. Chu and 3 others

Human-AI collectives produce the most accurate differential diagnoses

Overview

This paper investigates the accuracy of differential diagnoses produced by human-AI collectives compared to individual doctors or AI language models (LLMs) alone.
The researchers gathered medical case data and responses from both doctors and LLMs, then developed methods to harmonize and aggregate the open-ended answers into a unified differential diagnosis.
The results show that human-AI collectives outperformed individual doctors or LLMs in terms of diagnostic accuracy, providing evidence for the benefits of human-AI collaboration in healthcare.

Plain English Explanation

The paper explores a way to combine the knowledge and reasoning abilities of both humans and artificial intelligence (AI) systems to make more accurate medical diagnoses. The researchers collected real-world medical case data and recorded the differential diagnoses (lists of possible conditions) provided by both human doctors and large language model (LLM) AI systems.

They then developed methods to take these open-ended responses from the humans and AIs, identify common themes and insights, and integrate them into a unified differential diagnosis. The key insight is that by harnessing the complementary strengths of doctors and AI, the combined human-AI collective was able to outperform either the doctors or the AI systems working alone in terms of the accuracy of their medical diagnoses.

This suggests that human-AI collaboration in healthcare could be a powerful approach, leveraging the unique capabilities of both to improve medical decision-making. It also provides evidence against the idea of fully autonomous AI agents being able to match human-AI teams in sensitive domains like healthcare.

Technical Explanation

The researchers first gathered a dataset of real medical cases, including detailed case histories, test results, and final diagnoses. They then recruited a panel of experienced doctors to provide open-ended differential diagnoses for each case. In parallel, they used several state-of-the-art large language model (LLM) AI systems to generate their own differential diagnoses based on the case information.

To harmonize and aggregate these diverse responses, the researchers developed novel multi-agent systems that could identify common themes, consolidate overlapping insights, and produce a unified differential diagnosis. This combined human-AI output was then evaluated against the ground truth diagnoses to assess its accuracy.

The results showed that the human-AI collectives significantly outperformed both individual doctors and individual LLM systems in terms of diagnostic accuracy. This suggests that harnessing the complementary strengths of human and AI reasoning can lead to better medical decision-making compared to either working alone.

Critical Analysis

The paper provides a rigorous and well-designed study demonstrating the potential benefits of human-AI collaboration in healthcare. However, the authors acknowledge several caveats and limitations:

The medical cases used were relatively straightforward, and the results may not generalize to more complex or ambiguous cases.
The study only looked at the final diagnostic accuracy, not the process or reasoning behind the decisions.
The LLM systems used were state-of-the-art at the time, but AI capabilities are rapidly evolving, and future systems may perform differently.
The human doctors were highly experienced, and the results may vary with less skilled or experienced clinicians.

Additionally, while the harmonization and aggregation methods were innovative, their inner workings were not fully explained, making it difficult to assess their generalizability or potential biases. Further research is needed to explore the nuances of human-AI interactions and the mechanisms underlying the improved diagnostic accuracy.

Conclusion

This paper provides compelling evidence that human-AI collectives can outperform either doctors or AI systems working alone in the task of differential diagnosis. By combining the unique strengths of human medical expertise and AI reasoning, the researchers demonstrated a path towards enhancing diagnostic accuracy and improving patient outcomes in healthcare.

These findings suggest that the future of medical decision-making may lie in effective human-AI collaboration, rather than relying solely on either human clinicians or autonomous AI agents. As AI capabilities continue to advance, understanding how to best integrate these technologies with human expertise will be a crucial area of research and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Human-AI collectives produce the most accurate differential diagnoses

N. Zoller, J. Berger, I. Lin, N. Fu, J. Komarneni, G. Barabucci, K. Laskowski, V. Shia, B. Harack, E. A. Chu, V. Trianni, R. H. J. M. Kurvers, S. M. Herzog

Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

6/24/2024

Towards Human-AI Collaboration in Healthcare: Guided Deferral Systems with Large Language Models

Joshua Strong, Qianhui Men, Alison Noble

Large language models (LLMs) present a valuable technology for various applications in healthcare, but their tendency to hallucinate introduces unacceptable uncertainty in critical decision-making situations. Human-AI collaboration (HAIC) can mitigate this uncertainty by combining human and AI strengths for better outcomes. This paper presents a novel guided deferral system that provides intelligent guidance when AI defers cases to human decision-makers. We leverage LLMs' verbalisation capabilities and internal states to create this system, demonstrating that fine-tuning small-scale LLMs with data from large-scale LLMs greatly enhances performance while maintaining computational efficiency and data privacy. A pilot study showcases the effectiveness of our proposed deferral system.

7/4/2024

Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology

Dyke Ferber, Omar S. M. El Nahhas, Georg Wolflein, Isabella C. Wiest, Jan Clusmann, Marie-Elisabeth Le{ss}man, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jager, Manuel Salto-Tellez, Nikolaus Schultz, Daniel Truhn, Jakob Nikolas Kather

Multimodal artificial intelligence (AI) systems have the potential to enhance clinical decision-making by interpreting various types of medical data. However, the effectiveness of these models across all medical fields is uncertain. Each discipline presents unique challenges that need to be addressed for optimal performance. This complexity is further increased when attempting to integrate different fields into a single model. Here, we introduce an alternative approach to multimodal medical AI that utilizes the generalist capabilities of a large language model (LLM) as a central reasoning engine. This engine autonomously coordinates and deploys a set of specialized medical AI tools. These tools include text, radiology and histopathology image interpretation, genomic data processing, web searches, and document retrieval from medical guidelines. We validate our system across a series of clinical oncology scenarios that closely resemble typical patient care workflows. We show that the system has a high capability in employing appropriate tools (97%), drawing correct conclusions (93.6%), and providing complete (94%), and helpful (89.2%) recommendations for individual patient cases while consistently referencing relevant literature (82.5%) upon instruction. This work provides evidence that LLMs can effectively plan and execute domain-specific models to retrieve or synthesize new information when used as autonomous agents. This enables them to function as specialist, patient-tailored clinical assistants. It also simplifies regulatory compliance by allowing each component tool to be individually validated and approved. We believe, that our work can serve as a proof-of-concept for more advanced LLM-agents in the medical domain.

4/9/2024

Confidence-weighted integration of human and machine judgments for superior decision-making

Felipe Y'a~nez, Xiaoliang Luo, Omar Valerio Minero, Bradley C. Love

Large language models (LLMs) have emerged as powerful tools in various domains. Recent studies have shown that LLMs can surpass humans in certain tasks, such as predicting the outcomes of neuroscience studies. What role does this leave for humans in the overall decision process? One possibility is that humans, despite performing worse than LLMs, can still add value when teamed with them. A human and machine team can surpass each individual teammate when team members' confidence is well-calibrated and team members diverge in which tasks they find difficult (i.e., calibration and diversity are needed). We simplified and extended a Bayesian approach to combining judgments using a logistic regression framework that integrates confidence-weighted judgments for any number of team members. Using this straightforward method, we demonstrated in a neuroscience forecasting task that, even when humans were inferior to LLMs, their combination with one or more LLMs consistently improved team performance. Our hope is that this simple and effective strategy for integrating the judgments of humans and machines will lead to productive collaborations.

8/16/2024