MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

Read original: arXiv:2402.01620 - Published 6/11/2024 by Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

Overview

Researchers propose a method called "MAGDi" for distilling reasoning knowledge from multi-agent discussions into a base student model.
The process involves having multiple teacher language models engage in a multi-round discussion on a reasoning problem, creating a multi-agent interaction graph (MAG).
The researchers then use their structured distillation method, MAGDi, to distill the reasoning knowledge from these graphs into a base student model.

Plain English Explanation

The researchers have developed a new way to teach AI models how to reason more effectively. The core idea is to have multiple AI "teachers" engage in a back-and-forth discussion about a problem that requires reasoning. As the teachers discuss, the researchers capture the flow of their interaction in a special diagram called a "multi-agent interaction graph" (MAG).

The researchers then take the information stored in this MAG and use a technique called "structured distillation" to transfer the reasoning knowledge to a single "student" AI model. This allows the student model to learn complex reasoning skills by observing the collaborative exchange between the teacher models, rather than having to learn everything from scratch on its own.

The key benefit of this approach is that it can help AI models become better at generalizing their reasoning abilities to new, unseen problems. By distilling the high-level reasoning strategies captured in the MAG, the student model can develop a more robust and flexible problem-solving capacity.

Technical Explanation

The researchers' method, called "MAGDi" (Multi-Agent interaction Graph Distillation), involves several key steps:

Multi-Agent Discussion: Multiple pre-trained language models are prompted to engage in a multi-round discussion about a given reasoning problem. This back-and-forth interaction is captured in a multi-agent interaction graph (MAG).
Structured Distillation: The researchers then apply their structured distillation technique to distill the reasoning knowledge encoded in the MAG into a base student model. This allows the student to learn complex reasoning strategies by observing the teacher models' interactions.
Improved Generalization: The student model trained using MAGDi exhibits better generalization of its reasoning abilities compared to models trained on individual tasks or without the structured distillation process.

The researchers demonstrate the effectiveness of their approach through experiments on various reasoning benchmarks, showing that the MAGDi-trained student model outperforms models trained using other methods.

Critical Analysis

The researchers acknowledge several limitations of their work. First, the method relies on having access to multiple pre-trained teacher models, which may not always be available. Additionally, the process of constructing the MAG and performing the structured distillation can be computationally expensive, which may limit the scalability of the approach.

Furthermore, the researchers do not explore the potential biases or inconsistencies that may arise from aggregating the reasoning strategies of multiple teacher models. There may be cases where the teachers' outputs conflict with each other, and the distillation process may not be able to resolve these disagreements effectively.

Finally, the researchers do not provide a detailed analysis of the types of reasoning skills that are most effectively transferred from the MAG to the student model. It would be valuable to understand the specific strengths and weaknesses of the MAGDi approach compared to other reasoning distillation methods, such as MMIDR or Midgard.

Conclusion

The researchers have proposed an innovative approach, called MAGDi, for distilling reasoning knowledge from multi-agent discussions into a base student model. By capturing the collaborative exchange between multiple teacher models in a multi-agent interaction graph, the researchers are able to transfer complex reasoning strategies to the student, resulting in improved generalization of its problem-solving abilities.

While the method has some limitations, it represents an important step forward in the field of AI reasoning and knowledge transfer. By leveraging the collective wisdom of multiple expert models, the MAGDi approach could potentially help create AI systems that are more flexible, adaptable, and capable of tackling a wide range of reasoning challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency -- an inference technique that relies on model diversity.

6/11/2024

MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation

Liuyi Wang, Zongtao He, Mengjiao Shen, Jingwei Yang, Chengju Liu, Qijun Chen

Despite the remarkable developments of recent large models in Embodied Artificial Intelligence (E-AI), their integration into robotics is hampered by their excessive parameter sizes and computational demands. Towards the Vision-and-Language Navigation (VLN) task, a core task in E-AI, this paper reveals the great potential of using knowledge distillation for obtaining lightweight student models by proposing a Meta-Ability Guided Interactive Chain-of-distillation (MAGIC) method. Specifically, a Meta-Ability Knowledge Distillation (MAKD) framework is proposed for decoupling and refining the necessary meta-abilities of VLN agents. A Meta-Knowledge Randomization Weighting (MKRW) and a Meta-Knowledge Transferable Determination (MKTD) module are incorporated to dynamically adjust aggregation weights at the meta-ability and sample levels, respectively. Move beyond the traditional one-step unidirectional distillation, an Interactive Chain-of-Distillation (ICoD) learning strategy is proposed to allow students to give feedback to teachers, forming a new multi-step teacher-student co-evolution pipeline. Remarkably, on the R2R test unseen public leaderboard, our smallest model, MAGIC-S, with only 5% (11M) of the teacher's size, outperforms all previous methods under the same training data. Additionally, our largest model, MAGIC-L, surpasses the previous state-of-the-art by 5.84% in SPL and 3.18% in SR. Furthermore, a new dataset was collected and annotated from our living environments, where MAGIC-S demonstrated superior performance and real-time efficiency. Our code is publicly available on https://github.com/CrystalSixone/VLN-MAGIC.

6/27/2024

🚀

MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning

Inderjeet Nair, Lu Wang

We study the task of conducting structured reasoning as generating a reasoning graph from natural language input using large language models (LLMs). Previous approaches have explored various prompting schemes, yet they suffer from error propagation due to the autoregressive nature and single-pass-based decoding, which lack error correction capability. Additionally, relying solely on a single sample may result in the omission of true nodes and edges. To counter this, we draw inspiration from self-consistency (SC), which involves sampling a diverse set of reasoning chains and taking the majority vote as the final answer. To tackle the substantial challenge of applying SC on generated graphs, we propose MIDGARD (MInimum Description length Guided Aggregation of Reasoning in Directed acyclic graph) that leverages Minimum Description Length (MDL)-based formulation to identify consistent properties among the different graph samples generated by an LLM. This formulation helps reject properties that appear in only a few samples, which are likely to be erroneous, while enabling the inclusion of missing elements without compromising precision. Our method demonstrates superior performance than comparisons across various structured reasoning tasks, including argument structure extraction, explanation graph generation, inferring dependency relations among actions for everyday tasks, and semantic graph generation from natural texts.

6/4/2024

💬

Sub-goal Distillation: A Method to Improve Small Language Agents

Maryam Hashemzadeh, Elias Stengel-Eskin, Sarath Chandar, Marc-Alexandre Cote

While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model (770M parameters). Our approach involves constructing a hierarchical agent comprising a planning module, which learns through Knowledge Distillation from an LLM to generate sub-goals, and an execution module, which learns to accomplish these sub-goals using elementary actions. In detail, we leverage an LLM to annotate an oracle path with a sequence of sub-goals towards completing a goal. Subsequently, we utilize this annotated data to fine-tune both the planning and execution modules. Importantly, neither module relies on real-time access to an LLM during inference, significantly reducing the overall cost associated with LLM interactions to a fixed cost. In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7% (absolute). Our analysis highlights the efficiency of our approach compared to other LLM-based methods. Our code and annotated data for distillation can be found on GitHub.

5/7/2024