Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Read original: arXiv:2311.03099 - Published 6/14/2024 by Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li

💬

Overview

This paper introduces a technique called DARE (Decoupled Alignment and Robust Embedding) that allows language models (LMs) to acquire new capabilities by assimilating parameters from similar models without retraining or specialized hardware.
The authors show that the differences (delta parameters) between fine-tuned and pre-trained LMs are typically small and redundant, and DARE can effectively eliminate 90% or even 99% of these parameters without affecting the model's abilities.
DARE can be used as a versatile plug-in to merge multiple task-specific LMs into a single model with diverse capabilities, which is especially pronounced in large-scale LMs.
The merged LM can sometimes surpass the performance of any of the source models, providing a new discovery.

Plain English Explanation

Acquiring New Capabilities Without Retraining The paper explains how language models can learn new skills by incorporating parameters from similar models, without having to go through the entire retraining process. This is done using a technique called DARE, which can efficiently remove most of the differences (delta parameters) between the fine-tuned and pre-trained versions of a model, without affecting its performance. link to DARE paper

Merging Multiple Language Models The researchers also show how DARE can be used to combine several task-specific language models into a single model that has a diverse set of capabilities. This is particularly powerful for large-scale language models, where the merged model can sometimes outperform any of the individual source models. link to paper on abilities of large language models

Potential for Efficient Model Scaling This discovery suggests that there may be an efficient way to scale up language models by merging specialized models, rather than having to retrain a single large model from scratch. This could lead to significant improvements in the capabilities of AI systems without the need for massive computational resources. link to paper on teaching languages to large language models

Technical Explanation

The paper introduces a technique called DARE (Decoupled Alignment and Robust Embedding) that allows language models (LMs) to acquire new capabilities by assimilating parameters from similar, or "homologous," models without retraining or specialized hardware like GPUs.

The authors first show that the differences (delta parameters) between fine-tuned and pre-trained LMs are typically small, within a range of 0.002, and exhibit extreme redundancy. They then propose DARE, which Drops delta parameters with a ratio p and REscales the remaining ones by 1 / (1 - p) to approximate the original embeddings. This effectively eliminates 90% or even 99% of the delta parameters without affecting the model's abilities.

The researchers then use DARE as a versatile plug-in to sparsify the delta parameters of multiple task-specific SFT (Supervised Fine-Tuning) homologous models, and merge them into a single model by parameter fusing. link to paper on robust plug-and-play adaptation

The experiments show that this phenomenon is more pronounced in large-scale LMs, where the merged model can sometimes surpass the performance of any of the source models, providing a new discovery. The authors also utilize DARE to create a merged LM that ranks first among models with 7 billion parameters on the Open LLM Leaderboard. link to paper on expansion of spoken language understanding

Critical Analysis

The paper presents an intriguing approach for efficiently scaling up language models by merging specialized models, rather than retraining a single large model from scratch. This could lead to significant improvements in the capabilities of AI systems without the need for massive computational resources.

However, the authors do not address potential limitations or caveats of their approach. For example, it's unclear how well the merged model would perform on a wide range of tasks compared to a model trained from scratch on a diverse dataset. Additionally, the paper does not explore the effects of this approach on model robustness, fairness, or safety. link to paper on debiasing algorithm through model adaptation

Further research is needed to understand the broader implications and potential issues with this technique, as well as its applicability to other types of AI models beyond language models. It will be important for the research community to critically examine the findings and consider the long-term consequences of such model merging approaches.

Conclusion

This paper introduces a novel technique called DARE that enables language models to acquire new capabilities by assimilating parameters from similar models, without the need for retraining or specialized hardware. The authors demonstrate that DARE can effectively merge multiple task-specific language models into a single model with diverse capabilities, particularly for large-scale language models.

This discovery suggests that there may be an efficient way to scale up language models by leveraging existing specialized models, rather than having to retrain a single large model from scratch. If further research can address the potential limitations and implications of this approach, it could lead to significant advancements in the capabilities and accessibility of AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li

In this paper, we unveil that Language Models (LMs) can acquire new capabilities by assimilating parameters from homologous models without retraining or GPUs. We first introduce DARE to set most delta parameters (i.e., the disparity between fine-tuned and pre-trained parameters) to zeros without affecting the abilities of Supervised Fine-Tuning (SFT) LMs, which randomly Drops delta parameters with a ratio $p$ And REscales the remaining ones by $1 / (1 - p)$ to approximate the original embeddings. Then, we use DARE as a versatile plug-in to sparsify delta parameters of multiple SFT homologous models for mitigating parameter interference and merge them into a single model by parameter fusing. We experiment with encoder- and decoder-based LMs, showing that: (1) SFT delta parameter value ranges are typically small (within 0.002) with extreme redundancy, and DARE can effortlessly eliminate 90% or even 99% of them; (2) DARE can merge multiple task-specific LMs into one LM with diverse capabilities. Notably, this phenomenon is more pronounced in large-scale LMs, where the merged LM reveals the potential to surpass the performance of any source LM, providing a new discovery. We also utilize DARE to create a merged LM that ranks first among models with 7 billion parameters on the Open LLM Leaderboard.

6/14/2024

Unlocking the Potential of Model Merging for Low-Resource Languages

Mingxu Tao, Chen Zhang, Quzhe Huang, Tianyao Ma, Songfang Huang, Dongyan Zhao, Yansong Feng

Adapting large language models (LLMs) to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT). However, this CT-then-SFT approach struggles with limited data in the context of low-resource languages, failing to balance language modeling and task-solving capabilities. We thus propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. We use model merging to develop task-solving LLMs for low-resource languages without SFT data in the target languages. Our experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data. Observing performance saturation in model merging with more training tokens, we further analyze the merging process and introduce a slack variable to the model merging algorithm to mitigate the loss of important parameters, thereby enhancing performance. We hope that model merging can benefit more human languages suffering from data scarcity with its higher data efficiency.

7/10/2024

💬

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Wei Lu, Rachel K. Luu, Markus J. Buehler

The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.

9/6/2024

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

7/19/2024