Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Read original: arXiv:2406.18820 - Published 7/1/2024 by Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Overview

This paper presents a novel approach called "Universal Checkpointing" to efficiently and flexibly checkpoint large-scale distributed training for machine learning models.
Checkpointing is the process of saving the current state of a training process so that it can be resumed from that point, which is crucial for long-running, large-scale training.
The proposed method aims to address the limitations of existing checkpointing techniques, such as high overhead, inflexibility, and inefficiency at scale.

Plain English Explanation

When training large machine learning models, the training process can take a very long time and often needs to be interrupted for various reasons, such as system failures or the need to resume training from a previous state. To enable this, researchers use a technique called "checkpointing," which saves the current state of the training process so that it can be restarted from that point later.

However, existing checkpointing techniques have several drawbacks. They can be slow and inefficient, especially as the training scale increases. They may also lack flexibility, making it difficult to customize the checkpointing process to the specific needs of a given training task.

The paper introduces a new approach called "Universal Checkpointing" that aims to address these limitations. The key idea is to provide a more efficient and flexible checkpointing system that can be easily adapted to different training scenarios and scales. This could help researchers and engineers more reliably train large, complex machine learning models without having to worry as much about interruptions or system failures.

Technical Explanation

The paper introduces a novel checkpointing system called "Universal Checkpointing" that aims to address the limitations of existing approaches. The key technical contributions include:

Flexible Checkpointing: The system allows for customizable checkpointing policies, enabling users to choose what and when to checkpoint based on their specific needs. This includes the ability to selectively checkpoint only certain parts of the model or training state.
Efficient Checkpointing: The system uses a layered design and optimization techniques to minimize the overhead of the checkpointing process, even for large-scale distributed training.
Scalable Checkpointing: The system is designed to scale efficiently as the training process and number of nodes in the distributed system increase, maintaining low overhead.
Transparent Integration: The checkpointing system is designed to be easily integrated into existing training pipelines with minimal changes to the user's code.

The paper evaluates the proposed system on several large-scale distributed training workloads and demonstrates significant improvements in checkpointing efficiency and flexibility compared to existing approaches, such as FastPersist and DataStates.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed Universal Checkpointing system, comparing it to state-of-the-art approaches. However, there are a few potential areas for further research and improvement:

Heterogeneous Environments: The paper focuses on homogeneous distributed training environments, but many real-world scenarios involve a mix of hardware and software resources. It would be interesting to see how the system performs in more heterogeneous settings, such as those explored in HetHub.
Data Consistency: The paper does not delve deeply into the implications of checkpointing on data consistency, which is an important consideration for long-running, large-scale training. Techniques like those explored in Training through Failure could be relevant.
Compression Techniques: While the paper mentions the use of compression to reduce checkpoint sizes, it does not provide a detailed comparison to more advanced compression methods, such as those used in EXCP. Incorporating such techniques could further improve the efficiency of the checkpointing system.

Overall, the Universal Checkpointing system presented in this paper represents a significant advancement in the field of large-scale distributed training, with a strong focus on flexibility, efficiency, and scalability. The critical analysis highlights areas for potential future research to further improve and expand the capabilities of the system.

Conclusion

The "Universal Checkpointing" system introduced in this paper offers a novel approach to efficiently and flexibly checkpoint large-scale distributed training for machine learning models. By addressing the limitations of existing checkpointing techniques, the proposed system can help researchers and engineers more reliably train complex models without having to worry as much about interruptions or system failures.

The key technical contributions include a flexible and customizable checkpointing system, efficient checkpointing algorithms, and scalable design. Evaluation results demonstrate significant improvements in checkpointing performance compared to state-of-the-art methods.

While the paper presents a robust and well-designed system, the critical analysis suggests opportunities for further research, such as improving the system's performance in heterogeneous environments, addressing data consistency concerns, and incorporating more advanced compression techniques. Addressing these areas could further enhance the capabilities and real-world applicability of the Universal Checkpointing system.

Overall, this paper represents an important step forward in the field of large-scale distributed training, and the ideas and techniques presented here could have a lasting impact on the way researchers and practitioners approach the challenges of training complex machine learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang

Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

7/1/2024

ByteCheckpoint: A Unified Checkpointing System for LLM Development

Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu

The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, significantly diminishing training efficiency. Besides, when transferring checkpoints across tasks, checkpoint resharding, defined as loading checkpoints into parallel configurations differing from those used for saving, is often required according to the characteristics and resource quota of specific tasks. Previous checkpointing systems [16,3,33,6] assume consistent parallel configurations, failing to address the complexities of checkpoint transformation during resharding. Furthermore, in the industry platform, developers create checkpoints from different training frameworks[23,36,21,11], each with its own unique storage and I/O logic. This diversity complicates the implementation of unified checkpoint management and optimization. To address these challenges, we introduce ByteCheckpoint, a PyTorch-native multi-framework LLM checkpointing system that supports automatic online checkpoint resharding. ByteCheckpoint employs a data/metadata disaggregated storage architecture, decoupling checkpoint storage from the adopted parallelism strategies and training frameworks. We design an efficient asynchronous tensor merging technique to settle the irregular tensor sharding problem and propose several I/O performance optimizations to significantly enhance the efficiency of checkpoint saving and loading. Experimental results demonstrate ByteCheckpoint's substantial advantages in reducing checkpoint saving (by up to 529.22X) and loading (by up to 3.51X) costs, compared to baseline methods.

7/30/2024

Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

8/20/2024

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$times$ faster checkpointing and 2.2$times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

6/18/2024