MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

2406.11385

Published 6/28/2024 by Yuyan Zhou, Liang Song, Bingning Wang, Weipeng Chen

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

Abstract

The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose textbf{M}odel textbf{E}xclusive textbf{T}ask textbf{A}rithmetic for merging textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.

Create account to get full access

Overview

This paper proposes a novel approach called MetaGPT for merging large language models (LLMs) using "model exclusive task arithmetic."
The key idea is to leverage the specialized capabilities of individual LLMs to create a more capable "merged" model, without requiring full fine-tuning or retraining of the base models.
The authors demonstrate that MetaGPT can outperform individual LLMs and other model merging techniques on a variety of downstream tasks.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become incredibly powerful at tasks like text generation, question answering, and language understanding. However, each model has its own specialized strengths and weaknesses - some may excel at certain tasks, while others may be better at different ones.

The researchers behind this paper wanted to find a way to combine the unique capabilities of multiple LLMs into a single, more capable "merged" model. Their approach, called MetaGPT, does this without requiring the models to be fully retrained or fine-tuned. Instead, they use a technique called "model exclusive task arithmetic" to essentially add up the specialized skills of the individual models.

For example, if one model is great at answering science questions and another is great at creative writing, MetaGPT can blend those capabilities together to create a model that excels at both. This allows the merged model to take advantage of the best parts of each individual model, rather than just averaging their overall performance.

The researchers show that MetaGPT can outperform the individual LLMs, as well as other model merging techniques, on a variety of different tasks. This suggests that their approach is an effective way to combine the strengths of multiple large language models into a single, more capable system.

Technical Explanation

The key innovation in this paper is the MetaGPT framework for merging large language models using "model exclusive task arithmetic." Rather than simply averaging the weights or outputs of multiple LLMs, MetaGPT leverages the specialized capabilities of each individual model to create a more capable merged model.

The core idea is to first evaluate the performance of each LLM on a diverse set of tasks. This allows the researchers to identify the specific strengths and weaknesses of each model. They then use this information to define "model exclusive tasks" - tasks where one LLM significantly outperforms the others.

These model exclusive tasks are used to define a set of "task embeddings" that capture the unique capabilities of each LLM. MetaGPT then combines these task embeddings using a simple additive operation to create a merged model that can take advantage of the specialized skills of the individual LLMs.

The authors demonstrate the effectiveness of MetaGPT through extensive experiments on a range of downstream tasks, including language modeling, question answering, and natural language inference. They show that the MetaGPT model outperforms both the individual LLMs as well as other model merging techniques like AdaMerging and AM-GPT.

Critical Analysis

One potential limitation of the MetaGPT approach is that it relies on the ability to accurately identify the specialized capabilities of each LLM. If the task-level performance evaluation is not comprehensive or representative, the resulting task embeddings may not fully capture the models' strengths and weaknesses.

Additionally, the authors note that the effectiveness of MetaGPT may depend on the degree of overlap or complementarity between the individual LLMs. If the models have highly redundant capabilities, the benefits of merging may be less significant. Further research is needed to understand how MetaGPT performs in more diverse model ensembles, such as those spanning different architectures or training datasets.

It would also be interesting to see how MetaGPT compares to more advanced model merging techniques like those explored in MegaVerse and Multilingual Machine Translation, which leverage cross-model interactions and knowledge transfer.

Conclusion

Overall, the MetaGPT framework represents a novel and promising approach to combining the specialized capabilities of multiple large language models. By using "model exclusive task arithmetic," the researchers have demonstrated a way to create a more capable merged model without requiring full retraining or fine-tuning of the base models.

The potential implications of this work are significant, as it suggests that we can leverage the diverse strengths of different LLMs to tackle increasingly complex and varied tasks. As the field of large language models continues to evolve, techniques like MetaGPT may play an important role in unlocking the full potential of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📈

AdaMerging: Adaptive Model Merging for Multi-Task Learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, Dacheng Tao

Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase.

5/29/2024

cs.LG cs.CV

AMGPT: a Large Language Model for Contextual Querying in Additive Manufacturing

Achuth Chandrasekhar, Jonathan Chan, Francis Ogoke, Olabode Ajenifujah, Amir Barati Farimani

Generalized large language models (LLMs) such as GPT-4 may not provide specific answers to queries formulated by materials science researchers. These models may produce a high-level outline but lack the capacity to return detailed instructions on manufacturing and material properties of novel alloys. Enhancing a smaller model with specialized domain knowledge may provide an advantage over large language models which cannot be retrained quickly enough to keep up with the rapid pace of research in metal additive manufacturing (AM). We introduce AMGPT, a specialized LLM text generator designed for metal AM queries. The goal of AMGPT is to assist researchers and users in navigating the extensive corpus of literature in AM. Instead of training from scratch, we employ a pre-trained Llama2-7B model from Hugging Face in a Retrieval-Augmented Generation (RAG) setup, utilizing it to dynamically incorporate information from $sim$50 AM papers and textbooks in PDF format. Mathpix is used to convert these PDF documents into TeX format, facilitating their integration into the RAG pipeline managed by LlamaIndex. Expert evaluations of this project highlight that specific embeddings from the RAG setup accelerate response times and maintain coherence in the generated text.

6/4/2024

cs.CL cs.LG

💬

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

4/4/2024

cs.CL

💬

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating massive languages? 2) Which factors affect LLMs' performance in translation? We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our empirical results show that translation capabilities of LLMs are continually involving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of translation directions but still faces a large gap towards the commercial translation system like Google Translate, especially on low-resource languages. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, LLM can acquire translation ability in a resource-efficient way and generate moderate translation even on zero-resource languages. Second, instruction semantics can surprisingly be ignored when given in-context exemplars. Third, cross-lingual exemplars can provide better task guidance for low-resource translation than exemplars in the same language pairs. Code will be released at: https://github.com/NJUNLP/MMT-LLM.

6/17/2024

cs.CL