Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

Read original: arXiv:2408.03663 - Published 8/9/2024 by Jaewook Lee, Yoel Park, Seulki Lee

Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

Overview

This paper presents a novel approach to designing extremely memory-efficient convolutional neural networks (CNNs) for on-device vision tasks.
The key focus is on developing CNNs that can run on resource-constrained devices like smartphones and embedded systems with limited memory and computing power.
The proposed method aims to significantly reduce the memory footprint of CNNs while maintaining high accuracy on vision tasks.

Plain English Explanation

[The paper describes a new way to design very compact and efficient convolutional neural networks that can run on devices with limited memory, like smartphones. The goal is to create CNN models that use very little memory while still performing well on visual recognition tasks. This is important because many real-world applications, like face recognition or object detection on phones, need to run AI models but the devices don't have a lot of memory or computing power.]

Technical Explanation

[The paper introduces a set of techniques to design extremely memory-efficient CNNs for on-device vision tasks. Some of the key methods include:

Patch-based Inference: The CNN model processes the input image in small patches rather than the full image at once, reducing the memory footprint.
Depthwise Separable Convolutions: A type of convolution operation that splits the convolution into two steps, significantly reducing the model's parameter count.
Channel Pruning: Selectively removing less important channels in the CNN to further reduce the model size.

The authors evaluate their approach on several benchmark vision tasks and show that they can create CNN models that are 10-50x smaller than state-of-the-art models, while only incurring a modest accuracy decrease.]

Critical Analysis

[The paper presents a compelling approach to designing memory-efficient CNNs for on-device applications. The techniques seem well-justified and the experimental results are promising.

However, the paper does not discuss some potential limitations or caveats:

The impact of the reduced model size on inference latency and power consumption is not explored. Smaller models may not necessarily translate to faster or more energy-efficient inference on resource-constrained devices.
The experiments are conducted on established benchmark datasets, but the performance on real-world, noisy data from mobile sensors is not evaluated.
There is no discussion of the generalization of these techniques to other neural network architectures beyond CNNs.

Overall, the paper makes an important contribution, but further research is needed to fully understand the practical implications and limitations of the proposed methods.]

Conclusion

[This paper introduces a novel approach to designing extremely memory-efficient convolutional neural networks for on-device vision tasks. By employing techniques like patch-based inference and depthwise separable convolutions, the authors are able to create CNN models that are 10-50x smaller than state-of-the-art alternatives, while maintaining reasonable accuracy. This is a significant advancement that could enable the deployment of sophisticated computer vision capabilities on resource-constrained devices like smartphones and embedded systems. The proposed methods represent an important step towards making AI more accessible and practical for a wide range of real-world applications.]

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

Jaewook Lee, Yoel Park, Seulki Lee

In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision tasks, such as image classification and object detection, using extremely low memory, i.e., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, 'input segmentation' divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, 'patch tunneling' builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, 'bottleneck reordering' rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. The experiment result shows that the proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, i.e., up to 89x and 3.1x smaller than MobileNet (i.e., 5.6 MB) and MCUNet (i.e., 196 KB), respectively.

8/9/2024

🤯

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

4/4/2024

🏋️

On-Device Training Under 256KB Memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

4/4/2024

🧠

On the Efficiency of Convolutional Neural Networks

Andrew Lavin

Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.

5/22/2024