vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Read original: arXiv:2406.06542 - Published 6/12/2024 by Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Overview

Proposes a memory-efficient and patch-based inference approach called MCUNetV2 for running deep learning models on microcontrollers with limited memory
Introduces new techniques to enable efficient deployment of neural networks on tiny devices with less than 256KB of memory
Explores methods for optimizing the deployment of small transformers on low-power microcontrollers

Plain English Explanation

The provided research paper focuses on developing techniques to run advanced deep learning models on tiny, low-power microcontrollers (MCUs) with very limited memory, typically less than 256KB. This is an important challenge, as MCUs are widely used in IoT devices, wearables, and other embedded applications, but often lack the computational resources to run state-of-the-art AI models.

The researchers propose a method called MCUNetV2 that uses a patch-based approach to enable efficient inference of deep learning models on resource-constrained MCUs. This involves breaking down the input data into smaller patches, processing them individually, and then combining the results - a strategy that helps reduce the memory footprint required to run the model.

The paper also introduces other techniques, such as Device Training under 256KB Memory and Memory-Efficient Energy-Adaptive Inference, to further optimize the deployment of neural networks on tiny MCUs. These methods aim to enable the use of more advanced AI models, like small transformers, on low-power microcontrollers, as described in the paper Optimizing Deployment of Tiny Transformers on Low-Power MCUs.

The key idea is to make deep learning more accessible and practical for a wide range of IoT and embedded applications, where the computational resources are often severely constrained.

Technical Explanation

The MCUNetV2 approach proposed in the paper leverages a patch-based inference strategy to reduce the memory requirements of running deep learning models on microcontrollers. The model is divided into a feature extractor and a classifier, with the feature extractor operating on smaller image patches. This allows the classifier to process the extracted features without needing to store the entire input image in memory, significantly reducing the memory footprint.

The paper also introduces Device Training under 256KB Memory, a technique that enables neural network training directly on the target MCU, even with its limited memory. This helps optimize the model for the specific hardware constraints and can lead to better performance compared to training on more powerful devices and then deploying to the MCU.

Additionally, the researchers present Memory-Efficient Energy-Adaptive Inference, a method that dynamically adjusts the model's computational complexity based on the available energy budget. This allows the system to trade-off accuracy for energy efficiency, further enhancing the deployment of deep learning models on low-power MCUs.

The paper also explores Optimizing Deployment of Tiny Transformers on Low-Power MCUs, showcasing techniques to efficiently run even small transformer models on resource-constrained microcontrollers. This is particularly relevant, as transformers have shown promising results in various AI tasks but are typically more memory-intensive than conventional neural network architectures.

Critical Analysis

The research presented in the paper addresses an important challenge in the field of embedded AI, namely enabling the deployment of advanced deep learning models on tiny, low-power microcontrollers. The proposed techniques, such as MCUNetV2, Device Training under 256KB Memory, and Memory-Efficient Energy-Adaptive Inference, demonstrate promising approaches to overcome the memory and computational limitations of MCUs.

One potential limitation of the research is the focus on specific benchmark datasets and tasks, which may not fully represent the diverse range of real-world applications for embedded AI. Additionally, the paper does not provide a comprehensive comparison of the proposed methods with alternative approaches in the literature, which could help readers better understand the relative strengths and weaknesses of the techniques.

Further research could explore the generalization of these techniques to a wider variety of deep learning models, including larger and more complex architectures, as well as their performance on a broader set of practical use cases. Investigating the trade-offs between accuracy, energy efficiency, and other relevant metrics would also be valuable for practitioners in the field.

Conclusion

The research presented in this paper represents an important step towards enabling the deployment of advanced AI models on resource-constrained microcontrollers. The proposed techniques, such as MCUNetV2, Device Training under 256KB Memory, and Memory-Efficient Energy-Adaptive Inference, demonstrate effective strategies to overcome the memory and computational limitations of MCUs.

By making deep learning more accessible and practical for a wide range of IoT and embedded applications, the methods described in this paper have the potential to significantly impact the development of intelligent and energy-efficient devices. As the demand for AI-powered IoT solutions continues to grow, advancements in this area will be crucial in unlocking new possibilities and expanding the reach of artificial intelligence in the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang

IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that ours{} can reduce from $12.0%$ to $49.5%$ RAM usage and from $20.6%$ to $53.0%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, ours{} can reduce the memory bottleneck by $61.5%$, enabling more models to be deployed on low-end MCUs.

6/12/2024

🤯

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

4/4/2024

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Mark Deutel, Frank Hannig, Christopher Mutschler, Jurgen Teich

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

8/29/2024

Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks

Jaewook Lee, Yoel Park, Seulki Lee

In this paper, we introduce a memory-efficient CNN (convolutional neural network), which enables resource-constrained low-end embedded and IoT devices to perform on-device vision tasks, such as image classification and object detection, using extremely low memory, i.e., only 63 KB on ImageNet classification. Based on the bottleneck block of MobileNet, we propose three design principles that significantly curtail the peak memory usage of a CNN so that it can fit the limited KB memory of the low-end device. First, 'input segmentation' divides an input image into a set of patches, including the central patch overlapped with the others, reducing the size (and memory requirement) of a large input image. Second, 'patch tunneling' builds independent tunnel-like paths consisting of multiple bottleneck blocks per patch, penetrating through the entire model from an input patch to the last layer of the network, maintaining lightweight memory usage throughout the whole network. Lastly, 'bottleneck reordering' rearranges the execution order of convolution operations inside the bottleneck block such that the memory usage remains constant regardless of the size of the convolution output channels. The experiment result shows that the proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58%). To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks, i.e., up to 89x and 3.1x smaller than MobileNet (i.e., 5.6 MB) and MCUNet (i.e., 196 KB), respectively.

8/9/2024