Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Read original: arXiv:2404.14632 - Published 4/24/2024 by Muhammad Adnan, Amar Phanishayee, Janardhan Kulkarni, Prashant J. Nair, Divya Mahajan

🤿

Overview

Presents a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs)
Addresses both single-device and distributed pipeline and tensor model parallel training scenarios
Optimizes accelerators for training metrics like throughput/TDP under fixed area and power constraints
Tackles the large design space exploration problem using a heuristic-based critical path approach

Plain English Explanation

The paper introduces a new method to design specialized hardware, called accelerators, that can efficiently train deep learning models. Deep learning models are complex artificial intelligence algorithms that require a lot of computing power to train.

The researchers' approach can optimize accelerators for different training scenarios, including when the training is done on a single device or distributed across multiple devices. It focuses on maximizing the performance of the accelerator, such as how much data it can process per unit of power, while staying within fixed limits on physical size and power consumption.

Designing the best hardware for training deep learning models is challenging because there are many possible architectural choices. Prior work has tried to simplify this by only looking at one type of deep learning layer or just focusing on running the trained model, not the full training process. In contrast, this paper takes a more comprehensive approach that considers the entire training workflow.

The key innovation is a heuristic-based method that prioritizes allocating the accelerator's resources to the most critical computations in the deep learning model. This allows the system to efficiently explore the vast design space and find high-performing accelerator designs. For distributed training, the method selects the best combination of accelerator designs across the different stages of the training pipeline.

Technical Explanation

The paper presents a novel technique for hardware-aware neural architecture search to find optimized hardware accelerators for end-to-end training of deep neural networks.

The approach addresses both single-device and distributed pipeline and tensor model parallel training scenarios. It optimizes the accelerator designs for key training metrics like throughput per Watt, under fixed area and power constraints. This is important because the design space exploration for specialized DNN training accelerators is extremely large, with many possible architectural choices.

The key innovation is a heuristic-based critical path approach to efficiently explore this large design space. First, it performs a local search to determine the best architecture for each stage of the training pipeline or tensor model parallel computation. This uses a novel technique that prioritizes allocating accelerator resources to the most critical operators in the DNN workload.

To handle the complexities of distributed training, the local search selects multiple top candidate designs per stage. Then a global search identifies the best overall accelerator configuration by optimizing the training throughput across all the stages.

The paper evaluates this approach on 11 different DNN models. Compared to a recent inference-focused work called Spotlight, the proposed method converges to a design 31x faster on average and offers 12x higher training throughput. The generated accelerator designs also achieve 12% higher throughput than Google's TPU architecture.

Critical Analysis

The paper presents a comprehensive and innovative approach to hardware-aware neural architecture search for DNN training accelerators. By considering the full end-to-end training workflow, including both single-device and distributed scenarios, the researchers tackle a more realistic and challenging problem than prior work.

The heuristic-based critical path optimization technique is a clever way to navigate the vast design space efficiently. This builds on previous research in this area, but applies it in a novel context. The ability to select multiple top candidate designs per stage is also an important capability for handling the complexities of distributed training.

However, the paper does not provide much detail on the specific heuristics used or how they were developed. It would be helpful to understand the reasoning behind the prioritization of critical operators and other design choices. Additionally, the evaluation is limited to a set of 11 DNN models, so further testing on a wider range of workloads would strengthen the claims about the generalizability of the approach.

Another potential limitation is that the method still relies on simulations and modeling, rather than real hardware implementation. As noted in related work, there can be gaps between simulated and actual performance, so physical prototyping and testing would be an important next step.

Overall, this paper presents a promising technique that could significantly advance the state-of-the-art in hardware-aware neural architecture search for DNN training accelerators. Further research to refine the heuristics and validate the approach on real hardware would be valuable contributions to the field.

Conclusion

This paper introduces a novel hardware-aware neural architecture search technique to find optimized accelerator designs for end-to-end training of deep neural networks. The key innovation is a heuristic-based critical path approach that efficiently explores the vast design space, addressing both single-device and distributed training scenarios.

The proposed method demonstrates significant performance improvements over prior work, converging to high-throughput accelerator designs much faster and achieving up to 12% higher throughput than Google's TPU. This research represents an important step forward in developing specialized hardware to make deep learning more computationally efficient and accessible.

Further development and physical validation of this approach could lead to transformative advances in the underlying hardware that powers modern artificial intelligence systems. By co-designing the algorithms and the underlying computational substrate, researchers can unlock new capabilities and push the boundaries of what is possible with machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Muhammad Adnan, Amar Phanishayee, Janardhan Kulkarni, Prashant J. Nair, Divya Mahajan

In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture.

4/24/2024

Integrated Hardware Architecture and Device Placement Search

Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan

Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.

7/19/2024

HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

Zhewen Yu, Sudarshan Sreeram, Krish Agrawal, Junyi Wu, Alexander Montgomerie-Corcoran, Cheng Zhang, Jianyi Cheng, Christos-Savvas Bouganis, Yiren Zhao

Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3$times$ to 4.2$times$ compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: url{https://github.com/Yu-Zhewen/HASS}

6/6/2024

New!Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Konstantin Lubeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Muller, Federico Nicol'as Peccia, Felix Thommes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

9/16/2024