Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Read original: arXiv:2408.13656 - Published 8/27/2024 by Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, Han Zhao

Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Overview

This paper introduces a new method called "Localize-and-Stitch" for efficiently merging pre-trained machine learning models.
The key ideas are to identify task-relevant parameters in the models and only update those during the merging process, resulting in more efficient model updates.
The authors demonstrate the effectiveness of their approach through experiments on various vision and language tasks.

Plain English Explanation

When machine learning models are trained on different tasks, there is often overlap in the knowledge they have learned. Localize-and-Stitch aims to take advantage of this by intelligently combining pre-trained models to create a single, more capable model.

The key insight is that not all parts of a model are equally important for each task. Localize-and-Stitch focuses on identifying the most relevant parameters for each task and only updating those during the merging process. This allows the model to be updated more efficiently, without having to re-train the entire network from scratch.

By selectively updating the important parts of the model, Localize-and-Stitch can create a single, unified model that is capable of performing multiple tasks. This can be particularly useful in real-world applications where a model needs to handle a variety of different scenarios.

Technical Explanation

The Localize-and-Stitch approach works by first identifying the task-relevant parameters in each pre-trained model. This is done by analyzing the gradients of the model outputs with respect to the model parameters, which provides a measure of how important each parameter is for a given task.

Once the task-relevant parameters have been identified, the authors use a sparse task arithmetic operation to efficiently combine the pre-trained models. This involves only updating the relevant parameters during the merging process, rather than updating the entire model. This results in a more compact and efficient merged model, without sacrificing performance.

The authors evaluate their Localize-and-Stitch approach on a variety of vision and language tasks, including image classification, object detection, and natural language processing. They demonstrate that their method outperforms traditional model merging approaches in terms of both computational efficiency and task performance.

Critical Analysis

One potential limitation of the Localize-and-Stitch approach is that it assumes the pre-trained models are relatively well-aligned in terms of the tasks they have learned. If the models are trained on very different tasks, the task-relevant parameters may not overlap as much, which could reduce the effectiveness of the merging process.

Additionally, the authors do not address the potential for negative transfer, where the merging of models could actually degrade performance on certain tasks. This is an important consideration when combining pre-trained models, and it would be valuable for the authors to explore this issue further.

Despite these potential limitations, the Localize-and-Stitch approach represents an interesting and promising direction for efficient model merging. By focusing on the task-relevant parameters, the authors have developed a method that can effectively combine pre-trained models while minimizing computational overhead.

Conclusion

The Localize-and-Stitch paper presents a novel approach for efficiently merging pre-trained machine learning models. By identifying and selectively updating the task-relevant parameters, the authors demonstrate a more efficient and effective model merging process, with applications in a variety of real-world scenarios.

While the approach has some potential limitations, the core ideas behind Localize-and-Stitch represent an important step forward in the field of model compression and transfer learning. As machine learning models continue to grow in complexity and capability, techniques like this will become increasingly valuable for streamlining model deployment and maintenance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, Han Zhao

Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny ($1%$ of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead. Our code is available at https://github.com/yifei-he/Localize-and-Stitch.

8/27/2024

📈

Localizing Task Information for Improved Model Merging and Compression

Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez, Franc{c}ois Fleuret, Pascal Frossard

Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve >99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i.e., parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8.2Gb while retaining 99.7% of original performance.

5/14/2024

Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, Yu Cheng

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on $12$ datasets for both discriminative and generative tasks demonstrate the effectiveness of our method, showing an average improvement of $28.34%$ in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. (Our implementation is available in https://github.com/LZY-the-boys/Twin-Mergin.)

6/26/2024

👨‍🏫

Efficient Stitchable Task Adaptation

Haoyu He, Zizheng Pan, Jing Liu, Jianfei Cai, Bohan Zhuang

The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes. Source code is available at https://github.com/ziplab/Stitched_LLaMA

7/10/2024