Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Read original: arXiv:2310.12670 - Published 8/20/2024 by Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He and 1 other

Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Overview

The paper presents a reliable and efficient in-memory fault tolerance solution for large language model pretraining.
It introduces a novel "3D-Parallel" pretraining approach that can tolerate hardware failures during training.
The solution leverages a combination of fast in-memory checkpointing, dynamic data remapping, and redundant computations to provide fault tolerance with minimal performance overhead.

Plain English Explanation

The paper describes a way to make the training of large language models, such as GPT-3, more reliable and efficient. Training these models can take a very long time, often running for weeks or months on powerful computing hardware. During this long training process, there is a risk that the hardware could fail, causing the training to be lost and have to start over from the beginning.

The researchers have developed a new approach called "3D-Parallel" pretraining that can tolerate these hardware failures. The key ideas are:

Fast In-Memory Checkpointing: Instead of saving the training progress to disk, which is slow, the system keeps checkpoints of the training in fast computer memory. This allows the training to be resumed quickly if a failure occurs.
Dynamic Data Remapping: If a piece of hardware fails, the system can dynamically remap the data and computations to other available hardware, allowing the training to continue without interruption.
Redundant Computations: The system performs some extra computations in parallel to provide backup in case of failures. This adds a small amount of overhead but greatly improves reliability.

By combining these techniques, the researchers were able to create a pretraining system that is both reliable, able to tolerate hardware failures, and efficient, with minimal performance impact compared to training without fault tolerance.

Technical Explanation

The paper introduces a novel "3D-Parallel" pretraining approach that can tolerate hardware failures during the training of large language models. The key components of this approach are:

Fast In-Memory Checkpointing: The system uses an efficient in-memory checkpointing mechanism to capture the state of the training process at regular intervals. This allows the training to be resumed quickly from the last checkpoint in the event of a hardware failure, without the need to restart from the beginning.
Dynamic Data Remapping: If a piece of hardware fails during training, the system can dynamically remap the data and computations to other available hardware resources. This allows the training to continue without interruption, rather than having to start over.
Redundant Computations: To provide additional fault tolerance, the system performs some redundant computations in parallel. This adds a small amount of overhead but greatly improves the reliability of the training process, as the redundant computations can be used to recover from failures.

The combination of these techniques allows the 3D-Parallel pretraining approach to achieve reliable and efficient large language model training, with minimal performance impact compared to training without fault tolerance.

Critical Analysis

The paper presents a well-designed and comprehensive solution for providing fault tolerance in large language model pretraining. The authors have identified a critical issue in the field and developed a novel approach to address it.

One potential limitation of the 3D-Parallel approach is the additional overhead introduced by the redundant computations. While the authors claim this overhead is minimal, it would be important to understand the exact performance impact in different scenarios, especially for very large and resource-intensive models.

Additionally, the paper does not explore the scalability of the approach as the size of the language model or the training dataset increases. It would be valuable to understand how the fault tolerance mechanisms scale and whether there are any practical limits to the size of models that can be trained using this approach.

Finally, the paper does not discuss the potential impact of hardware failures on the final model quality or performance. It would be interesting to understand whether the fault tolerance mechanisms introduced in this work could have any influence on the final model capabilities, beyond just ensuring the training completes successfully.

Conclusion

The paper presents a reliable and efficient in-memory fault tolerance solution for large language model pretraining, which is a critical issue in the field of large-scale AI model training. The 3D-Parallel approach, with its combination of fast in-memory checkpointing, dynamic data remapping, and redundant computations, provides a comprehensive solution to this problem.

While the paper highlights the potential benefits of this approach, further research is needed to fully understand its scalability and the impact on final model quality. Nonetheless, the techniques introduced in this work represent an important step forward in making large language model training more robust and reliable, which could have significant implications for the development of advanced AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

8/20/2024

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang

Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

7/1/2024

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$times$ faster checkpointing and 2.2$times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

6/18/2024

ByteCheckpoint: A Unified Checkpointing System for LLM Development

Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu

The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, significantly diminishing training efficiency. Besides, when transferring checkpoints across tasks, checkpoint resharding, defined as loading checkpoints into parallel configurations differing from those used for saving, is often required according to the characteristics and resource quota of specific tasks. Previous checkpointing systems [16,3,33,6] assume consistent parallel configurations, failing to address the complexities of checkpoint transformation during resharding. Furthermore, in the industry platform, developers create checkpoints from different training frameworks[23,36,21,11], each with its own unique storage and I/O logic. This diversity complicates the implementation of unified checkpoint management and optimization. To address these challenges, we introduce ByteCheckpoint, a PyTorch-native multi-framework LLM checkpointing system that supports automatic online checkpoint resharding. ByteCheckpoint employs a data/metadata disaggregated storage architecture, decoupling checkpoint storage from the adopted parallelism strategies and training frameworks. We design an efficient asynchronous tensor merging technique to settle the irregular tensor sharding problem and propose several I/O performance optimizations to significantly enhance the efficiency of checkpoint saving and loading. Experimental results demonstrate ByteCheckpoint's substantial advantages in reducing checkpoint saving (by up to 529.22X) and loading (by up to 3.51X) costs, compared to baseline methods.

7/30/2024