A Micro Architectural Events Aware Real-Time Embedded System Fault Injector

Read original: arXiv:2401.08397 - Published 6/12/2024 by Enrico Magliano, Alessio Carpegna, Alessadro Savino, Stefano Di Carlo

📊

Overview

The paper discusses the challenges posed by the increasing complexity of safety-critical real-time embedded systems (SACRES) in terms of reliability, trustworthiness, and security.
Key issues include susceptibility to phenomena like voltage spikes, electromagnetic interference, neutron strikes, and out-of-range temperatures, which can induce bit-flipping, soft errors, and transient data corruption.
Such malfunctions can have real-world implications, particularly in critical sectors like automotive, avionics, or aerospace, potentially causing harm to individuals.
The paper introduces a novel fault injector designed to facilitate the monitoring, aggregation, and examination of micro-architectural events, focusing on ensuring the repeatability of fault injections.

Plain English Explanation

The paper tackles the growing complexity of safety-critical systems, like those used in cars, planes, and spacecraft. As these systems become more advanced, they face new challenges that can compromise their reliability, trustworthiness, and security.

For example, sudden changes in voltage, electromagnetic interference, and extreme temperatures can cause tiny errors in the systems' electronic components, leading to data corruption and system malfunctions. These issues can be particularly dangerous in critical applications, where system failures could potentially harm people.

To address these concerns, the researchers developed a new tool that can intentionally introduce controlled errors into the system. This allows them to closely study how the system responds to different types of faults and identify ways to make the system more resilient. 1

By understanding how these safety-critical systems react to various disruptions, the researchers aim to help engineers design more robust and trustworthy systems that can withstand the challenges of the modern world.

Technical Explanation

The paper introduces a novel fault injector that enables the monitoring, aggregation, and examination of micro-architectural events within safety-critical real-time embedded systems (SACRES). This is achieved by leveraging the microprocessor's Performance Monitoring Unit (PMU) and the debugging interface, with a focus on ensuring the repeatability of fault injections.

The fault injection methodology targets bit-flipping within the memory system, affecting both CPU registers and random access memory (RAM). By analyzing the outcomes of these intentional faults, the researchers can establish a robust correlation between the identified faults and the essential timing predictability required by SACRES.

This approach allows for a thorough analysis of the impact of soft errors, which can lead to system faults and potentially hazardous states, particularly in critical sectors like automotive, avionics, or aerospace. 2

Critical Analysis

The paper's emphasis on ensuring the repeatability of fault injections is a key strength, as it allows for more reliable and consistent analysis of the system's behavior under different fault conditions. This is important for developing effective mitigation strategies and validating the resilience of SACRES.

However, the paper does not address the potential limitations of the fault injection methodology, such as the representativeness of the targeted faults or the scalability of the approach to more complex systems. Additionally, the paper does not explore the potential for false positives or the impact of multiple concurrent faults, which could be valuable areas for further research. 3

It would also be beneficial to see a more comprehensive evaluation of the fault injector's performance and its ability to accurately simulate real-world fault scenarios, as this would strengthen the confidence in the research findings and their applicability to practical SACRES design challenges.

Conclusion

This paper presents a novel fault injector that enables a deeper understanding of the reliability and resilience of safety-critical real-time embedded systems. By systematically injecting bit-flipping faults and analyzing their impact, the researchers can identify vulnerabilities and develop strategies to improve the trustworthiness of these critical systems.

The insights gained from this research can inform the design of more robust and secure SACRES, which is crucial for ensuring the safety and reliability of applications in sectors like automotive, avionics, and aerospace. Continued advancements in this area will be essential as the complexity of these systems continues to grow. 4

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

A Micro Architectural Events Aware Real-Time Embedded System Fault Injector

Enrico Magliano, Alessio Carpegna, Alessadro Savino, Stefano Di Carlo

In contemporary times, the increasing complexity of the system poses significant challenges to the reliability, trustworthiness, and security of the SACRES. Key issues include the susceptibility to phenomena such as instantaneous voltage spikes, electromagnetic interference, neutron strikes, and out-of-range temperatures. These factors can induce switch state changes in transistors, resulting in bit-flipping, soft errors, and transient corruption of stored data in memory. The occurrence of soft errors, in turn, may lead to system faults that can propel the system into a hazardous state. Particularly in critical sectors like automotive, avionics, or aerospace, such malfunctions can have real-world implications, potentially causing harm to individuals. This paper introduces a novel fault injector designed to facilitate the monitoring, aggregation, and examination of micro-architectural events. This is achieved by harnessing the microprocessor's PMU and the debugging interface, specifically focusing on ensuring the repeatability of fault injections. The fault injection methodology targets bit-flipping within the memory system, affecting CPU registers and RAM. The outcomes of these fault injections enable a thorough analysis of the impact of soft errors and establish a robust correlation between the identified faults and the essential timing predictability demanded by SACRES.

6/12/2024

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform large scale RTL-level fault injections to characterize Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency standard commonly used for safety-critical applications such as autonomous vehicles. We show that, under soft errors, all four configurations of the NPU fall short of the required level of resiliency for a variety of neural networks running on the NPU. We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%. We achieve so through selective protection, where hardware structures are selectively protected (e.g., duplicated, hardened) based on their sensitivity to soft errors and their silicon areas. To identify the optimal configuration that minimizes the area overhead while meeting the ASIL-D standard, the main challenge is the large search space associated with the time-consuming RTL simulation. To address this challenge, we present a statistical analysis tool that is validated against Arm silicon and that allows us to quickly navigate hundreds of billions of fault sites without exhaustive RTL fault injections. We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%.

4/16/2024

A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

7/2/2024

Averting multi-qubit burst errors in surface code magic state factories

Jason D. Chadwick, Christopher Kang, Joshua Viszlai, Sophia Fuhui Lin, Frederic T. Chong

Fault-tolerant quantum computation relies on the assumption of time-invariant, sufficiently low physical error rates. However, current superconducting quantum computers suffer from frequent disruptive noise events, including cosmic ray impacts and shifting two-level system defects. Several methods have been proposed to mitigate these issues in software, but they add large overheads in terms of physical qubit count, as it is difficult to preserve logical information through burst error events. We focus on mitigating multi-qubit burst errors in magic state factories, which are expected to comprise up to 95% of the space cost of future quantum programs. Our key insight is that magic state factories do not need to preserve logical information over time; once we detect an increase in local physical error rates, we can simply turn off parts of the factory that are affected, re-map the factory to the new chip geometry, and continue operating. This is much more efficient than previous more general methods, and is resilient even under many simultaneous impact events. Using precise physical noise models, we show an efficient ray detection method and evaluate our strategy in different noise regimes. Compared to existing baselines, we find reductions in ray-induced overheads by several orders of magnitude, reducing total qubitcycle cost by geomean 6.5x to 13.9x depending on the noise model. This work reduces the burden on hardware by providing low-overhead software mitigation of these errors.

5/2/2024