On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Read original: arXiv:2407.10734 - Published 8/29/2024 by Mark Deutel, Frank Hannig, Christopher Mutschler, Jurgen Teich

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Overview

• This paper presents a method for training deep neural networks (DNNs) on resource-constrained Cortex-M microcontrollers with as little as 256KB of memory.

• The researchers developed techniques to fully quantize DNNs, allowing them to fit within the tight memory constraints of these low-power embedded devices.

• They also introduce an on-device training approach that can fine-tune the quantized models directly on the target hardware, without the need for a separate training phase on a more powerful system.

Plain English Explanation

The researchers in this paper have found a way to train complex artificial intelligence (AI) models on small, low-power microcontroller chips. These tiny chips, called Cortex-M, are found in all kinds of everyday devices like fitness trackers, smart home gadgets, and industrial equipment.

Typically, these microcontrollers don't have enough memory to run large, powerful AI models. The researchers solved this problem by developing techniques to "quantize" the AI models - that is, they compressed the models down to use less memory without losing too much performance.

But the researchers went one step further. They also created a way for the AI models to be trained directly on the microcontroller chips, without needing a more powerful computer for the training process. This "on-device training" approach means the models can be fine-tuned and updated right where they're being used, without having to send data back to a central server.

This is an important advancement because it allows AI to be deployed in a wide range of real-world applications, even on inexpensive, low-power hardware. It opens the door for smarter, more responsive devices that can adapt to their environment and user needs, all while preserving privacy and reducing the need for constant internet connectivity.

Technical Explanation

The paper introduces a method for training fully quantized deep neural networks (DNNs) directly on Cortex-M microcontrollers with as little as 256KB of memory.

The key innovations include:

Quantization: The researchers developed techniques to fully quantize the weights, activations, and gradients of the DNN to 8-bits or even 4-bits, allowing the models to fit within the tight memory constraints of Cortex-M chips.
On-Device Training: They introduce an on-device training approach that can fine-tune the quantized models directly on the target hardware, without the need for a separate training phase on a more powerful system. This resource-aware, task-adaptive training enables efficient model updates on the device itself.
Hardware Acceleration: The researchers leverage the DSP and SIMD capabilities of Cortex-M chips to accelerate the quantized DNN computations, achieving high inference throughput despite the limited resources.

Through extensive experiments, the paper demonstrates the effectiveness of this approach on various benchmark tasks, showing that the quantized, on-device trained models can achieve comparable or even better performance than their full-precision counterparts trained on more powerful hardware.

Critical Analysis

The paper provides a comprehensive and technically sound solution for deploying advanced AI models on resource-constrained embedded devices. The authors have clearly addressed some of the key challenges in the field of TinyML, such as memory constraints, model size, and the need for on-device adaptation.

However, the paper does not delve into the potential security and privacy implications of running AI inference and training on end-user devices. There are ongoing concerns about the vulnerability of quantized neural networks to various attacks, and the authors do not discuss any mitigation strategies or the overall robustness of their approach.

Additionally, the paper focuses primarily on the technical aspects and does not explore the broader societal implications of enabling widespread deployment of adaptive, on-device AI systems. Further research may be needed to understand the ethical considerations and potential unintended consequences of this technology.

Conclusion

This paper presents a significant advancement in the field of embedded machine learning, enabling the deployment of powerful AI models on resource-constrained microcontrollers. The researchers' techniques for quantization and on-device training open up new possibilities for smart, adaptive devices that can learn and update themselves directly on the edge, without the need for constant connectivity to the cloud.

While the technical achievements are impressive, future work should also address the security, privacy, and ethical implications of this technology. As AI becomes more ubiquitous in our everyday devices, it will be crucial to ensure these systems are reliable, robust, and aligned with societal values.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Mark Deutel, Frank Hannig, Christopher Mutschler, Jurgen Teich

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

8/29/2024

🏋️

On-Device Training Under 256KB Memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

4/4/2024

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang

IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that ours{} can reduce from $12.0%$ to $49.5%$ RAM usage and from $20.6%$ to $53.0%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, ours{} can reduce the memory bottleneck by $61.5%$, enabling more models to be deployed on low-end MCUs.

6/12/2024

🏋️

TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

6/12/2024