Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Read original: arXiv:2311.17371 - Published 7/19/2024 by Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, Arnu Pretorius

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Overview

This paper investigates the use of multi-agent debate between large language models (LLMs) for improving medical question-answering (Q&A) performance.
The authors benchmark different multi-agent debate setups and compare them to single-agent baselines to assess the potential benefits of this approach.
The paper explores the tradeoffs between having agents with different biases or the same bias, as well as the impact of allowing agents to interact freely or with preset stances.

Plain English Explanation

The researchers wanted to see if having multiple AI language models [<a href="https://aimodels.fyi/papers/arxiv/encouraging-divergent-thinking-large-language-models-through">1</a>, <a href="https://aimodels.fyi/papers/arxiv/learning-to-break-knowledge-enhanced-reasoning-multi">2</a>, <a href="https://aimodels.fyi/papers/arxiv/multiagent-collaboration-attack-investigating-adversarial-attacks-large">3</a>] debate each other could improve their ability to answer medical questions accurately. They set up different scenarios where the AI agents had either the same biases or different biases, and where the agents could either freely debate the answers or were given predetermined stances to argue for or against [<a href="https://aimodels.fyi/papers/arxiv/counterfactual-debating-preset-stances-hallucination-elimination-llms">4</a>, <a href="https://aimodels.fyi/papers/arxiv/debating-more-persuasive-llms-leads-to-more">5</a>]. The goal was to see which setup led to the best performance on medical Q&A tasks.

Technical Explanation

The authors designed a multi-agent debate framework where two AI agents with access to a shared knowledge base would argue for or against potential answers to medical questions. They experimented with different configurations, including:

Agents with the same bias vs. agents with different biases
Agents freely debating the answers vs. agents assigned preset stances to argue for or against

The researchers evaluated the agents' performance on a suite of medical Q&A tasks and compared the multi-agent setups to single-agent baselines. They found that the multi-agent approach could outperform single-agent models, but the specific benefits depended on the configuration. Allowing agents with different biases to freely debate led to the best overall performance, as it encouraged the agents to consider multiple perspectives and arrive at more robust answers.

Critical Analysis

The paper provides a valuable exploration of the potential benefits and tradeoffs of using multi-agent debate for medical Q&A. However, the authors acknowledge several limitations, such as the need for more diverse datasets and the potential for the debate process to introduce new biases or errors.

Additionally, the paper does not delve deeply into the underlying mechanisms or reasoning processes of the agents. Further research could examine how the agents' internal representations and decision-making evolve during the debate process, and how this impacts the final outputs.

Overall, the paper makes a compelling case for the use of multi-agent debate as a promising approach for improving the capabilities of large language models in medical and other domains. However, more work is needed to fully understand the strengths, weaknesses, and practical applications of this technique.

Conclusion

This paper investigates the use of multi-agent debate between large language models as a means of improving medical question-answering performance. The results suggest that the multi-agent approach can outperform single-agent baselines, particularly when the agents have different biases and are allowed to freely debate the answers. This work highlights the potential benefits of introducing more diverse perspectives and encouraging critical thinking in language models, with implications for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, Arnu Pretorius

Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a promising strategy for enhancing the truthfulness of LLMs. We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. Importantly, we find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies, such as self-consistency and ensembling using multiple reasoning paths. However, when performing hyperparameter tuning, several MAD systems, such as Multi-Persona, perform better. This suggests that MAD protocols might not be inherently worse than other approaches, but that they are more sensitive to different hyperparameter settings and difficult to optimize. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels, which can significantly enhance performance and even surpass all other non-debate protocols we evaluated. We provide an open-source repository to the community with several state-of-the-art protocols together with evaluation scripts to benchmark across popular research datasets.

7/19/2024

💬

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of tit for tat and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of tit for tat state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

7/18/2024

Learning to Break: Knowledge-Enhanced Reasoning in Multi-Agent Debate System

Haotian Wang, Xiyuan Du, Weijiang Yu, Qianglong Chen, Kun Zhu, Zheng Chu, Lian Yan, Yi Guan

Multi-agent debate system (MAD) imitating the process of human discussion in pursuit of truth, aims to align the correct cognition of different agents for the optimal solution. It is challenging to make various agents perform right and highly consistent cognition due to their limited and different knowledge backgrounds (i.e., cognitive islands), which hinders the search for the optimal solution. To address the challenge, we propose a novel underline{M}ulti-underline{A}gent underline{D}ebate with underline{K}nowledge-underline{E}nhanced framework (textbf{MADKE}) to promote the system to find the solution. First, we involve a shared retrieval knowledge pool in the debate process to solve the problem of limited and different knowledge backgrounds. Then, we propose an adaptive knowledge selection method to guarantee the accuracy and personalization of knowledge. This method allows agents to choose whether to use external knowledge in each conversation round according to their own needs. Our experimental results on six datasets show that our method achieves state-of-the-art results compared to existing single-agent and multi-agent methods. Further analysis reveals that the introduction of retrieval knowledge can help the agent to break cognitive islands in the debate process and effectively improve the consistency and correctness of the model. Moreover, MADKE using Qwen1.5-72B-Chat surpasses GPT-4 by +1.26% on average in six datasets, which validates that our method can help open-source LLMs achieve or even surpass the performance of GPT-4. Our code is available at url{https://github.com/FutureForMe/MADKE}.

7/12/2024

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Yiqun Zhang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song

Competitive debate is a complex task of computational argumentation. Large Language Models (LLMs) suffer from hallucinations and lack competitiveness in this field. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents, involving Searcher, Analyzer, Writer, and Reviewer, dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Competitive Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruit ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.

8/21/2024