Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Read original: arXiv:2404.09317 - Published 4/16/2024 by Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Overview

Examines the soft-error resilience of Arm's Ethos-U55 embedded machine learning accelerator
Evaluates the impact of single-event upsets (SEUs) on the accelerator's performance and accuracy
Provides insights into the design choices and mitigation strategies that contribute to the accelerator's resilience

Plain English Explanation

The paper investigates how well Arm's Ethos-U55 embedded machine learning accelerator can withstand errors caused by random events, such as cosmic radiation or electrical interference. These types of errors, known as single-event upsets (SEUs), can corrupt the internal state of the accelerator and lead to inaccurate results or even system failures.

The researchers tested the Ethos-U55 by introducing controlled SEUs and measuring how they affected the accelerator's performance on various machine learning tasks. This allowed them to understand the accelerator's strengths and weaknesses in terms of resilience. They also explored design choices and mitigation strategies that contribute to the Ethos-U55's ability to maintain accurate results even when errors occur.

Technical Explanation

The paper presents a comprehensive evaluation of the soft-error resilience of Arm's Ethos-U55 embedded machine learning accelerator. The researchers used a hardware-based fault injection framework to induce single-event upsets (SEUs) in the accelerator's internal state and measured the impact on its performance and accuracy across various machine learning workloads.

The experimental setup involved injecting faults into different components of the Ethos-U55, such as the registers, memory, and control logic, to simulate the effects of SEUs. The researchers then analyzed the resulting changes in the accelerator's outputs, including classification accuracy, inference time, and power consumption. This analysis provided insights into the Ethos-U55's resilience to soft errors and the effectiveness of the built-in mitigation strategies, such as parity error detection and error-correcting codes.

Critical Analysis

The paper provides a comprehensive and rigorous evaluation of the Ethos-U55's soft-error resilience, but it acknowledges several limitations and areas for further research. For example, the fault injection experiments were limited to single-bit errors, while real-world SEUs may involve multiple-bit errors or more complex fault patterns. Additionally, the paper only considers the impact of SEUs on the accelerator's performance and does not explore the potential effects on the overall system, such as the resilience of large language models to noisy instructions.

Future research could expand the fault injection experiments to more diverse error scenarios, investigate the interactions between the Ethos-U55 and other system components, and explore resource-efficient techniques for improving the resilience of neural networks in embedded systems. Additionally, the paper could have compared the Ethos-U55's resilience with other embedded machine learning accelerators to provide a more comprehensive understanding of the state of the art in this field.

Conclusion

The paper presents a detailed characterization of the soft-error resilience of Arm's Ethos-U55 embedded machine learning accelerator. The researchers' fault injection experiments and analysis provide valuable insights into the design choices and mitigation strategies that contribute to the accelerator's ability to maintain accurate results in the face of random errors. These findings have implications for the development of reliable and robust embedded systems for applications in safety-critical domains, such as autonomous vehicles and medical devices.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform large scale RTL-level fault injections to characterize Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency standard commonly used for safety-critical applications such as autonomous vehicles. We show that, under soft errors, all four configurations of the NPU fall short of the required level of resiliency for a variety of neural networks running on the NPU. We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%. We achieve so through selective protection, where hardware structures are selectively protected (e.g., duplicated, hardened) based on their sensitivity to soft errors and their silicon areas. To identify the optimal configuration that minimizes the area overhead while meeting the ASIL-D standard, the main challenge is the large search space associated with the time-consuming RTL simulation. To address this challenge, we present a statistical analysis tool that is validated against Arm silicon and that allows us to quickly navigate hundreds of billions of fault sites without exhaustive RTL fault injections. We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%.

4/16/2024

Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang

Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$times$ and reduces the tail latency by up to 4.6$times$, while improving the NPU utilization by 1.2$times$ on average, compared to state-of-the-art NPU sharing approaches.

9/16/2024

Adaptive Soft Error Protection for Deep Learning

Xinghua Xue, Cheng Liu

The rising incidence of soft errors in hardware systems represents a considerable risk to the reliability of deep learning systems and can precipitate severe malfunctions. Although essential, soft error mitigation can impose substantial costs on deep learning systems that are inherently demanding in terms of computation and memory. Previous research has primarily explored variations in vulnerability among different components of computing engines or neural networks, aiming for selective protection to minimize protection overhead. Our approach diverges from these studies by recognizing that the susceptibility of deep learning tasks to soft errors is heavily input-dependent. Notably, some inputs are simpler for deep learning models and inherently exhibit greater tolerance to soft errors. Conversely, more complex inputs are prone to soft error impact. Based on these insights, we introduce an adaptive soft error protection strategy that tailors protection to the computational demands of individual inputs. To implement this strategy, we develop a metric for assessing the complexity of inputs and deploy a lightweight machine learning algorithm to gauge input difficulty. Subsequently, we employ robust protection for challenging inputs and minimal protection for simpler ones. Our experimental evaluation across diverse datasets and deep learning tasks reveals that our adaptive strategy reduces the soft error protection overhead by an average of 46.9%, without compromising system reliability.

7/30/2024

📊

A Micro Architectural Events Aware Real-Time Embedded System Fault Injector

Enrico Magliano, Alessio Carpegna, Alessadro Savino, Stefano Di Carlo

In contemporary times, the increasing complexity of the system poses significant challenges to the reliability, trustworthiness, and security of the SACRES. Key issues include the susceptibility to phenomena such as instantaneous voltage spikes, electromagnetic interference, neutron strikes, and out-of-range temperatures. These factors can induce switch state changes in transistors, resulting in bit-flipping, soft errors, and transient corruption of stored data in memory. The occurrence of soft errors, in turn, may lead to system faults that can propel the system into a hazardous state. Particularly in critical sectors like automotive, avionics, or aerospace, such malfunctions can have real-world implications, potentially causing harm to individuals. This paper introduces a novel fault injector designed to facilitate the monitoring, aggregation, and examination of micro-architectural events. This is achieved by harnessing the microprocessor's PMU and the debugging interface, specifically focusing on ensuring the repeatability of fault injections. The fault injection methodology targets bit-flipping within the memory system, affecting CPU registers and RAM. The outcomes of these fault injections enable a thorough analysis of the impact of soft errors and establish a robust correlation between the identified faults and the essential timing predictability demanded by SACRES.

6/12/2024