CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Read original: arXiv:2407.17467 - Published 7/25/2024 by Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Overview

The paper proposes a scaling law called "CMR Scaling Law" to predict critical mixture ratios for continual pre-training of language models.
The scaling law aims to help determine the optimal ratio of new and old data for fine-tuning language models to maintain performance.
Experiments on various language models and datasets demonstrate the effectiveness of the proposed scaling law.

Plain English Explanation

The research paper introduces a new mathematical relationship, called the "CMR Scaling Law," that can predict the critical mixture ratio (CMR) for continually pre-training language models. The CMR is the optimal balance between using new data and old data when fine-tuning a language model to maintain its performance.

When you train a language model on a large dataset, it can become very capable at understanding and generating human-like text. However, over time, the model may start to forget or "drift" away from its original performance as it is further trained on new data. The CMR Scaling Law aims to help researchers and engineers determine the right mix of new and old data to use when continuously updating the language model, so that its performance is maintained or even improved.

The researchers tested their scaling law on several different language models and datasets, and found that it was able to accurately predict the CMR for those situations. This means the CMR Scaling Law could be a valuable tool for efficiently fine-tuning language models as they are adapted to new tasks or domains, without losing the knowledge they've already acquired.

Technical Explanation

The paper proposes the "CMR Scaling Law" to predict the critical mixture ratio (CMR) for continual pre-training of language models. The CMR represents the optimal balance between using new data and retaining old data during fine-tuning, in order to maintain the model's performance.

The researchers leverage insights from data mixing scaling laws to derive the CMR Scaling Law. They demonstrate its effectiveness across different language models, including GPT-2, GPT-3, and T5, as well as various datasets. The scaling law is able to accurately predict the CMR, which allows for more efficient fine-tuning of language models as they are adapted to new domains or tasks.

The key insight is that the CMR follows a power-law relationship with the size of the new dataset and the size of the original pre-training dataset. By fitting this relationship, the CMR Scaling Law can be used to determine the ideal mixture of new and old data for continued pre-training, without catastrophically forgetting the model's existing capabilities.

Critical Analysis

The CMR Scaling Law presented in this paper provides a principled and data-driven approach to managing the trade-off between using new data and retaining old knowledge when fine-tuning language models. This is an important practical challenge in the field of continual learning and domain adaptation.

However, the paper does not address several potential limitations and areas for further research. For example, the scaling law is evaluated only on textual datasets, and it's unclear whether it would generalize to other modalities like images or speech. Additionally, the experiments focus on relatively small-scale datasets and models, so the validity of the scaling law for truly massive language models and datasets remains to be seen.

Another open question is how the CMR Scaling Law would perform in more complex fine-tuning scenarios, such as when the new data comes from a significantly different distribution than the original pre-training data. The paper's findings may not hold up in such cases, and further research would be needed to understand the limits of the approach.

Overall, the CMR Scaling Law is a promising contribution, but additional work is required to fully understand its capabilities and limitations in real-world continual learning applications for language models.

Conclusion

The CMR Scaling Law introduced in this paper provides a novel way to predict the critical mixture ratio for continual pre-training of language models. By determining the optimal balance between using new data and retaining old knowledge, the scaling law can help improve the efficiency and effectiveness of fine-tuning language models for new tasks or domains.

The empirical validation of the scaling law across different models and datasets suggests it could be a valuable tool for the machine learning community. Further research is needed to explore its generalization to other modalities and more complex fine-tuning scenarios, but this work represents an important step forward in addressing the challenge of continual learning for large-scale language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Therefore, if we value the balance between efficiency and effectiveness, CMR can be consider as the optimal mixture ratio.Through extensive experiments, we ascertain the predictability of CMR, and propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

7/25/2024

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

6/4/2024

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Peng Yu, Jinxian Qu, Chenxi Liu, Zhonglin Jiang, Yong Chen, Luo Ji

Large Language Models (LLM) often needs to be Continual Pre-Trained (CPT) to obtain the unfamiliar language skill or adapt into new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study which bridge the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicate the optimal experimental set up. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark, but also some specific domains including math, coding and emotional intelligence. We deploy the final 70B version of LLM on an real-life chat system which obtain satisfying performance.

9/11/2024

📊

Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

Large language models exhibit exceptional generalization capabilities, primarily attributed to the utilization of diversely sourced data. However, conventional practices in integrating this diverse data heavily rely on heuristic schemes, lacking theoretical guidance. This research tackles these limitations by investigating strategies based on low-cost proxies for data mixtures, with the aim of streamlining data curation to enhance training efficiency. Specifically, we propose a unified scaling law, termed $textbf{BiMix}$, which accurately models the bivariate scaling behaviors of both data quantity and mixing proportions. We conduct systematic experiments and provide empirical evidence for the predictive power and fundamental principles of $textbf{BiMix}$. Notably, our findings reveal that entropy-driven training-free data mixtures can achieve comparable or even better performance than more resource-intensive methods. We hope that our quantitative insights can shed light on further judicious research and development in cost-effective language modeling.

7/12/2024