The PetShop Dataset -- Finding Causes of Performance Issues across Microservices

Read original: arXiv:2311.04806 - Published 4/10/2024 by Michaela Hardt, William R. Orchard, Patrick Blobaum, Shiva Kasiviswanathan, Elke Kirschbaum

🚀

Overview

Identifying root causes for unexpected or undesirable behavior in complex systems is a significant challenge, especially in modern cloud applications with numerous microservices.
Existing research has proposed various techniques, but a lack of standardized datasets for quantitative benchmarking has led research groups to create their own datasets.
This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications, including latency, requests, and availability metrics, as well as 68 injected performance issues.

Plain English Explanation

When complex systems, like cloud-based applications with many small services (known as microservices), start behaving unexpectedly or in an undesirable way, it can be very challenging to figure out the root cause of the problem. The research community has proposed various techniques to tackle this issue, but the lack of standardized datasets has meant that researchers have had to create their own datasets for testing and experimentation.

This paper presents a new dataset that is specifically designed to help evaluate methods for identifying the root causes of performance issues in microservice-based applications. The dataset includes metrics like latency (how long it takes for a request to be processed), the number of requests, and the availability (how often the system is able to respond) of the application, measured in 5-minute intervals. Importantly, the dataset also includes 68 deliberately introduced performance problems, which increase latency and reduce availability in different parts of the system.

By providing this dataset, the researchers hope to enable further development and testing of techniques for root cause analysis in these complex, microservice-based applications. Having a standardized dataset will make it easier for researchers to compare and improve their methods.

Technical Explanation

The paper presents a dataset designed to enable the evaluation of root cause analysis techniques in the context of microservice-based applications. The dataset includes time-series metrics such as latency, request counts, and availability, collected at 5-minute intervals from a distributed application. Crucially, the dataset also includes 68 injected performance issues that increase latency and reduce availability in various parts of the system.

The researchers showcase how this dataset can be used to evaluate the accuracy of a variety of root cause analysis methods, spanning different causal and non-causal approaches to the problem. By providing a standardized dataset, the authors aim to enable more systematic benchmarking and further development of techniques in this important area of cloud incident management and fault recovery.

Critical Analysis

The dataset presented in this paper addresses an important gap in the research landscape by providing a standardized resource for evaluating root cause analysis techniques in microservice-based applications. This is a valuable contribution, as the lack of such datasets has historically hindered the systematic comparison and improvement of these methods.

However, it is worth noting that the dataset is limited to a specific distributed application scenario and may not capture the full complexity of real-world cloud environments. Additionally, the introduced performance issues, while diverse, may not fully reflect the range of unexpected behaviors that can occur in production systems. Further research may be needed to assess the representativeness of this dataset and its applicability to a broader set of microservice architectures and failure modes.

Conclusion

This paper presents a valuable dataset for evaluating root cause analysis techniques in the context of microservice-based applications. By providing a standardized resource with both normal and anomalous performance data, the researchers aim to enable more systematic benchmarking and development of methods for identifying the underlying causes of issues in these complex, distributed systems. The dataset represents an important step forward in addressing the challenge of root cause analysis, which is crucial for the reliable operation of modern cloud-native applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🚀

The PetShop Dataset -- Finding Causes of Performance Issues across Microservices

Michaela Hardt, William R. Orchard, Patrick Blobaum, Shiva Kasiviswanathan, Elke Kirschbaum

Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area.

4/10/2024

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

Yuhan Zhu, Jian Wang, Bing Li, Xuxian Tang, Hao Li, Neng Zhang, Yuqi Zhao

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.

6/21/2024

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

Tingting Wang, Guilin Qi

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.

8/6/2024

✨

A Feature Dataset of Microservices-based Systems

Weipan Yang, Yongchao Xing, Yiming Lyu, Zhihao Liang, Zhiying Tu

Microservice architecture has become a dominant architectural style in the service-oriented software industry. Poor practices in the design and development of microservices are called microservice bad smells. In microservice bad smells research, the detection of these bad smells relies on feature data from microservices. However, there is a lack of an appropriate open-source microservice feature dataset. The availability of such datasets may contribute to the detection of microservice bad smells unexpectedly. To address this research gap, this paper collects a number of open-source microservice systems utilizing Spring Cloud. Additionally, feature metrics are established based on the architecture and interactions of Spring Boot style microservices. And an extraction program is developed. The program is then applied to the collected open-source microservice systems, extracting the necessary information, and undergoing manual verification to create an open-source feature dataset specific to microservice systems using Spring Cloud. The dataset is made available through a CSV file. We believe that both the extraction program and the dataset have the potential to contribute to the study of micro-service bad smells.

4/3/2024