TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators

Read original: arXiv:2406.06544 - Published 8/23/2024 by Yifan Qin, Zheyu Yan, Zixuan Pan, Wujie Wen, Xiaobo Sharon Hu, Yiyu Shi

TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators

Overview

This paper introduces TSB (Tiny Shared Block), a new technique for efficiently deploying deep neural networks (DNNs) on NVCIM accelerators, which are specialized hardware for running DNN workloads.
TSB aims to address the challenge of DNN deployment on resource-constrained edge devices with limited memory, by leveraging shared blocks of parameters across different DNN layers.
The authors demonstrate that TSB can significantly reduce the memory footprint of DNN models without sacrificing much accuracy, making it a promising approach for deploying DNNs on devices with tight memory constraints.

Plain English Explanation

Deep neural networks (DNNs) are powerful machine learning models that have achieved remarkable success in a wide range of applications, from computer vision to natural language processing. However, deploying these complex models on edge devices, such as smartphones or IoT sensors, can be a significant challenge due to the limited memory and computing resources available on these devices.

The TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper introduces a novel technique called Tiny Shared Block (TSB) to address this challenge. The key idea behind TSB is to share a small block of parameters across different layers of a DNN model, rather than having completely separate parameters for each layer.

By reusing these shared parameters, the overall memory footprint of the DNN model can be significantly reduced, making it more suitable for deployment on edge devices with limited memory. The authors demonstrate that TSB can achieve this memory reduction without sacrificing much of the model's accuracy, making it a powerful tool for efficient DNN deployment on resource-constrained devices.

This approach builds on previous work in the field of efficient DNN deployment, such as block-selective reprogramming for device training and memory-aware compressed multimodal deep learning. By introducing the novel concept of Tiny Shared Block, the authors aim to push the boundaries of what's possible in terms of efficient DNN deployment on edge devices.

Technical Explanation

The TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper presents a new technique called Tiny Shared Block (TSB) for efficiently deploying deep neural networks (DNNs) on NVCIM accelerators, which are specialized hardware for running DNN workloads.

The key idea behind TSB is to share a small block of parameters across different layers of a DNN model, rather than having completely separate parameters for each layer. This parameter sharing can significantly reduce the overall memory footprint of the DNN model, making it more suitable for deployment on edge devices with limited memory.

To implement TSB, the authors propose a novel DNN architecture that consists of a shared block of parameters, which is reused across multiple layers, and layer-specific parameters that capture the unique characteristics of each layer. This hybrid approach allows the model to maintain its expressive power while dramatically reducing its memory requirements.

The authors evaluate the effectiveness of TSB on several popular DNN models, including ResNet and MobileNet, and demonstrate that it can achieve significant memory savings (up to 70%) without sacrificing much of the model's accuracy. This is a notable improvement over previous approaches, such as block-selective reprogramming for device training and memory-aware compressed multimodal deep learning, which have had limited success in reducing the memory footprint of DNN models without significant accuracy degradation.

Critical Analysis

The TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper presents a promising approach for efficient DNN deployment on resource-constrained edge devices. The authors have demonstrated the effectiveness of their Tiny Shared Block (TSB) technique in reducing the memory footprint of DNN models without sacrificing much accuracy.

One potential limitation of the TSB approach is that it may not be suitable for all types of DNN architectures. The authors have primarily evaluated TSB on relatively simple models like ResNet and MobileNet, and it's unclear how well it would perform on more complex or specialized DNN architectures. Further research would be needed to explore the broader applicability of TSB across a wider range of DNN models.

Additionally, the authors' experiments were conducted on NVCIM accelerators, which are specialized hardware for DNN workloads. It would be interesting to see how well TSB performs on more generic edge devices, such as microcontrollers or embedded systems, which may have different hardware characteristics and constraints.

Another area for future research could be exploring the potential trade-offs between the degree of parameter sharing (i.e., the size of the shared block) and the model's accuracy. The authors have shown that TSB can achieve significant memory savings, but it's possible that further reductions in the shared block size could lead to more substantial accuracy degradation.

Overall, the TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper presents a valuable contribution to the field of efficient DNN deployment on edge devices. The TSB technique is a promising approach that merits further investigation and refinement to address the growing demand for high-performance, memory-efficient machine learning models on resource-constrained platforms.

Conclusion

The TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper introduces a novel technique called Tiny Shared Block (TSB) for efficiently deploying deep neural networks (DNNs) on NVCIM accelerators, which are specialized hardware for running DNN workloads.

The key innovation of TSB is the use of a shared block of parameters across different layers of a DNN model, rather than having completely separate parameters for each layer. This parameter sharing can significantly reduce the overall memory footprint of the DNN model, making it more suitable for deployment on edge devices with limited memory.

The authors have demonstrated the effectiveness of TSB in achieving substantial memory savings (up to 70%) without sacrificing much of the model's accuracy, which is a notable improvement over previous approaches. This makes TSB a promising technique for enabling the deployment of complex DNN models on resource-constrained edge devices, with potential applications in a wide range of domains, from computer vision to natural language processing.

While the paper focuses on NVCIM accelerators, the TSB approach could potentially be extended to other hardware platforms and edge devices. Further research is needed to explore the broader applicability of TSB, as well as potential trade-offs between the degree of parameter sharing and model accuracy. Nevertheless, the TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators paper represents an important step forward in the ongoing quest to bring the power of deep learning to the edge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TSB: Tiny Shared Block for Efficient DNN Deployment on NVCIM Accelerators

Yifan Qin, Zheyu Yan, Zixuan Pan, Wujie Wen, Xiaobo Sharon Hu, Yiyu Shi

Compute-in-memory (CIM) accelerators using non-volatile memory (NVM) devices offer promising solutions for energy-efficient and low-latency Deep Neural Network (DNN) inference execution. However, practical deployment is often hindered by the challenge of dealing with the massive amount of model weight parameters impacted by the inherent device variations within non-volatile computing-in-memory (NVCIM) accelerators. This issue significantly offsets their advantages by increasing training overhead, the time and energy needed for mapping weights to device states, and diminishing inference accuracy. To mitigate these challenges, we propose the Tiny Shared Block (TSB) method, which integrates a small shared 1x1 convolution block into the DNN architecture. This block is designed to stabilize feature processing across the network, effectively reducing the impact of device variation. Extensive experimental results show that TSB achieves over 20x inference accuracy gap improvement, over 5x training speedup, and weights-to-device mapping cost reduction while requiring less than 0.4% of the original weights to be write-verified during programming, when compared with state-of-the-art baseline solutions. Our approach provides a practical and efficient solution for deploying robust DNN models on NVCIM accelerators, making it a valuable contribution to the field of energy-efficient AI hardware.

8/23/2024

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu Yaswanth Raparti, Nitesh Pipralia

Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications.

7/15/2024

Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors

Matt Gorbett, Hossein Shirazi, Indrakshi Ray

Binary Neural Networks (BNNs) enable efficient deep learning by saving on storage and computational costs. However, as the size of neural networks continues to grow, meeting computational requirements remains a challenge. In this work, we propose a new form of quantization to tile neural network layers with sequences of bits to achieve sub-bit compression of binary-weighted neural networks. The method learns binary vectors (i.e. tiles) to populate each layer of a model via aggregation and reshaping operations. During inference, the method reuses a single tile per layer to represent the full tensor. We employ the approach to both fully-connected and convolutional layers, which make up the breadth of space in most neural architectures. Empirically, the approach achieves near fullprecision performance on a diverse range of architectures (CNNs, Transformers, MLPs) and tasks (classification, segmentation, and time series forecasting) with up to an 8x reduction in size compared to binary-weighted models. We provide two implementations for Tiled Bit Networks: 1) we deploy the model to a microcontroller to assess its feasibility in resource-constrained environments, and 2) a GPU-compatible inference kernel to facilitate the reuse of a single tile per layer in memory.

7/18/2024

🏋️

On-Device Training Under 256KB Memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

4/4/2024