TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

2307.09988

Published 6/12/2024 by Young D. Kwon, Rui Li, Stylianos I. Venieris, Jagmohan Chauhan, Nicholas D. Lane, Cecilia Mascolo

cs.LG cs.CV

🏋️

Abstract

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

Create account to get full access

Overview

On-device training is crucial for user personalization and privacy in IoT devices and microcontroller units (MCUs)
Traditional methods suffer from data scarcity, long training times, and significant accuracy loss
This paper introduces TinyTrain, an on-device training approach that addresses these challenges

Plain English Explanation

TinyTrain is a new way to train AI models directly on tiny, resource-constrained devices like smart home gadgets and sensor-equipped microcontrollers. This is important because it allows these devices to personalize their behavior for each user, without sending sensitive data to the cloud.

Traditional training methods often require a lot of time and computing power, or they result in significant accuracy loss when used on these limited devices. TinyTrain solves this by selectively updating only the most important parts of the AI model, based on the available data, memory, and processing power of the device. This makes the training much faster and more efficient, while still maintaining high accuracy.

Compared to standard fine-tuning approaches, TinyTrain can improve accuracy by 3.6-5.0% while using up to 1,098x less memory and 7.68x less computation during the training process. It also achieves 9.5x faster and 3.5x more energy-efficient training than previous methods, all while fitting within the tight 1 MB memory constraints of typical microcontrollers.

Technical Explanation

TinyTrain introduces a novel "task-adaptive sparse-update" method that dynamically selects which parts of the AI model to update based on the available user data, memory, and compute capabilities of the target device. This is a key innovation compared to prior work that either ignored data scarcity, required excessive training time, or suffered significant accuracy degradation.

The sparse-update approach in TinyTrain leverages a multi-objective criterion to determine which layers or channels of the model to fine-tune. This criterion jointly considers the user data, memory constraints, and compute resources, leading to high accuracy on new tasks while dramatically reducing the memory and computation required for the backpropagation step.

Evaluated on real-world edge devices, TinyTrain achieves up to 9.5x faster and 3.5x more energy-efficient training compared to existing methods, while maintaining a memory footprint that is 2.23x smaller than state-of-the-art approaches. This allows TinyTrain to operate within the tight 1 MB memory envelope typical of microcontroller-based platforms, enabling on-device personalization and privacy-preserving AI for a wide range of IoT applications.

Critical Analysis

The TinyTrain paper makes a compelling case for its approach, but there are a few potential limitations worth considering:

The evaluation is limited to a relatively small set of benchmark tasks and datasets. It would be valuable to see how TinyTrain performs on a wider range of real-world applications, especially those with more complex data and model architectures.
The paper does not provide much insight into how the multi-objective criterion for sparse-update selection was developed and tuned. More details on this process would help readers understand the tradeoffs involved and how to apply it to new domains.
While TinyTrain achieves impressive efficiency gains, the overall training time is still on the order of minutes. For some IoT applications, even faster adaptation may be required, so further reductions in training time could be beneficial.
The paper does not discuss the potential impact of TinyTrain on model security and robustness. On-device training could introduce new vulnerabilities that need to be addressed.

Overall, TinyTrain represents a promising step forward in enabling efficient, privacy-preserving on-device learning for tiny machine learning applications. Further research to address the limitations noted above could help unlock even more powerful and versatile on-device training capabilities.

Conclusion

TinyTrain introduces a novel on-device training approach that addresses the key challenges of data scarcity, long training times, and accuracy loss on resource-constrained IoT devices and microcontrollers. By selectively updating the most relevant parts of the AI model, TinyTrain can achieve significant efficiency gains while maintaining high accuracy, enabling personalized and privacy-preserving edge AI applications. As the demand for on-device intelligence continues to grow, innovations like TinyTrain will play a crucial role in unlocking the full potential of tiny machine learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

On-Device Training Under 256KB Memory

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

4/4/2024

cs.CV

Lightweight Deep Learning for Resource-Constrained Environments: A Survey

Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng

Over the past decade, the dominance of deep learning has prevailed across various domains of artificial intelligence, including natural language processing, computer vision, and biomedical signal processing. While there have been remarkable improvements in model accuracy, deploying these models on lightweight devices, such as mobile phones and microcontrollers, is constrained by limited resources. In this survey, we provide comprehensive design guidance tailored for these devices, detailing the meticulous design of lightweight models, compression methods, and hardware acceleration strategies. The principal goal of this work is to explore methods and concepts for getting around hardware constraints without compromising the model's accuracy. Additionally, we explore two notable paths for lightweight deep learning in the future: deployment techniques for TinyML and Large Language Models. Although these paths undoubtedly have potential, they also present significant challenges, encouraging research into unexplored areas.

4/15/2024

cs.CV cs.LG

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Victor J. B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini

Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.

4/5/2024

cs.LG cs.AI cs.DC cs.PF

🤷

On-device Online Learning and Semantic Management of TinyML Systems

Haoyu Ren, Xue Li, Darko Anicic, Thomas A. Runkler

Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.

5/17/2024

cs.LG cs.AI cs.DB cs.DC