Towards a Fault-Injection Benchmarking Suite

Read original: arXiv:2403.20319 - Published 4/1/2024 by Tianhao Wang, Robin Thunig, Horst Schirmeier

Towards a Fault-Injection Benchmarking Suite

Introduction

The paper discusses the lack of a dedicated benchmarking suite for fault-tolerance (FT) and fault-injection (FI) research in modern computer systems. Researchers typically use benchmarking suites from other domains, such as TACLeBench for worst-case execution time or MiBench for embedded systems, which leads to three major shortcomings:

Limited comparability across FT/FI papers due to different benchmarking programs and runtime environments.
Overlapping benchmarks, as researchers often select many benchmarks to ensure convincing demonstrations, regardless of potential overlap. This leads to inefficiency in computing resources and time.
Limited configurability, as out-of-the-box benchmarks from other domains are not optimal for FI campaigns and may not meet specific requirements such as minimal memory-access events or maximal execution time.

The authors suggest opening a discussion on the need for a dedicated benchmarking suite for the FT/FI domain and the requirements it should meet.

Preferable Benchmark Properties

The paper discusses desirable properties for a fault tolerance/fault injection (FT/FI) benchmarking suite. Existing benchmark suites like MiBench and TACLeBench focus on metrics less relevant to FT/FI research, such as system performance and worst-case execution time (WCET) analysis.

The authors propose several preferable properties for an FT/FI benchmarking suite:

Different granularities: The suite should include both isolated algorithm implementations for targeted analysis and integrated systems for realistic use cases.
Relevant benchmark selection: Benchmarks should be classified based on program characteristics (e.g., memory usage, runtime) and specific domains to enable representative and minimal experiment setups.
Resource-efficient fault injection: The suite should have a lightweight infrastructure and be configurable to meet execution time and memory usage requirements, reducing the necessary number of injections for large-scale experiments.
Self-contained runtime: The benchmarking suite should be self-contained and portable, with its own runtime, including a bare-bones operating system and standard library, to ensure comparability across different studies.

The paper cites TACLeBench as a good example of portability in this regard.

Evaluation of Selected Properties

The paper explores benchmarking suite properties to classify benchmarks based on their characteristics and demonstrate the benefits of a lightweight infrastructure for speeding up fault injection (FI). The authors conducted a preliminary study using MiBench and TACLeBench benchmarks, analyzing properties such as the number of dynamic instructions, memory-access locations, and Silent Data Corruption (SDC) count.

The plot in Figure 1 reveals that the benchmarks fill three quadrants and exhibit clustering patterns. This suggests that:

Benchmarks with very high data throughput (e.g., using SIMD instructions) are missing.
Some benchmarks may overlap in terms of their fault-space characteristics.
Outliers may be interesting targets for further investigation.

The authors also mention other interesting program characteristics, such as stack and heap usage, branching behavior, and memory-access granularity, which could be explored beyond the properties shown in the study.

Figure 1:
Number of dynamic instructions, unique memory-access locations and
SDCs (circle sizes) of MiBench [2]
benchmarks compiled with picolibc [5] and TACLeBench
[1] as used by Borchert et al. [4], measured with the
FAIL* fault-injection framework.

Figure 1: Number of dynamic instructions, unique memory-access locations and SDCs (circle sizes) of MiBench [2] benchmarks compiled with picolibc [5] and TACLeBench [1] as used by Borchert et al. [4], measured with the FAIL

fault-injection framework.*

The study compared the performance and fault injection (FI) results of MiBench benchmarks compiled in two different settings: a full-system setting with the eCos operating system and a bare-metal setting using picolibc. The results showed that the eCos variants had 37% to 128% more dynamic instructions and a similar increase in dynamic memory accesses compared to the bare-metal variants. The FI results also differed significantly between the two settings, with the full-system setting being more susceptible to failure modes such as timeouts. For example, the eCos variant of the crc benchmark experienced 97% more silent data corruptions (SDCs), 801% more timeouts, and 1491% more CPU exceptions compared to the picolibc variant. These findings highlight the importance of considering program granularities when evaluating the performance and reliability of software systems.

V Conclusion

The authors propose the development of a dedicated benchmarking suite for fault tolerance and fault injection (FT/FI) research. They argue that such a suite would benefit the field by providing comprehensive coverage of practical applications and program characteristics, offering benchmarks at different granularities, and including a self-contained, lightweight runtime with extensive configurability. However, the most relevant program properties for FT/FI still need to be determined.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards a Fault-Injection Benchmarking Suite

Tianhao Wang, Robin Thunig, Horst Schirmeier

Soft errors in memories and logic circuits are known to disturb program execution. In this context, the research community has been proposing a plethora of fault-tolerance (FT) solutions over the last decades, as well as fault-injection (FI) approaches to test, measure and compare them. However, there is no agreed-upon benchmarking suite for demonstrating FT or FI approaches. As a replacement, authors pick benchmarks from other domains, e.g. embedded systems. This leads to little comparability across publications, and causes behavioral overlap within benchmarks that were not selected for orthogonality in the FT/FI domain. In this paper, we want to initiate a discussion on what a benchmarking suite for the FT/FI domain should look like, and propose criteria for benchmark selection.

4/1/2024

A Survey on Failure Analysis and Fault Injection in AI Systems

Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

7/2/2024

📊

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named textbf{T}estset textbf{S}lot Guessing (textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52% and 57%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.

4/5/2024

🏋️

A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts

Samuele Bortolotti, Emanuele Marconato, Tommaso Carraro, Paolo Morettin, Emile van Krieken, Antonio Vergari, Stefano Teso, Andrea Passerini

The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available at: https://unitn-sml.github.io/rsbench.

6/18/2024