GraphFramEx: Towards Systematic Evaluation of Explainability Methods for Graph Neural Networks

2206.09677

Published 5/24/2024 by Kenza Amara, Rex Ying, Zitao Zhang, Zhihao Han, Yinan Shan, Ulrik Brandes, Sebastian Schemm, Ce Zhang

cs.LG cs.AI

🧠

Abstract

As one of the most popular machine learning models today, graph neural networks (GNNs) have attracted intense interest recently, and so does their explainability. Users are increasingly interested in a better understanding of GNN models and their outcomes. Unfortunately, today's evaluation frameworks for GNN explainability often rely on few inadequate synthetic datasets, leading to conclusions of limited scope due to a lack of complexity in the problem instances. As GNN models are deployed to more mission-critical applications, we are in dire need for a common evaluation protocol of explainability methods of GNNs. In this paper, we propose, to our best knowledge, the first systematic evaluation framework for GNN explainability, considering explainability on three different user needs. We propose a unique metric that combines the fidelity measures and classifies explanations based on their quality of being sufficient or necessary. We scope ourselves to node classification tasks and compare the most representative techniques in the field of input-level explainability for GNNs. For the inadequate but widely used synthetic benchmarks, surprisingly shallow techniques such as personalized PageRank have the best performance for a minimum computation time. But when the graph structure is more complex and nodes have meaningful features, gradient-based methods are the best according to our evaluation criteria. However, none dominates the others on all evaluation dimensions and there is always a trade-off. We further apply our evaluation protocol in a case study for frauds explanation on eBay transaction graphs to reflect the production environment.

Create account to get full access

Overview

This paper proposes a systematic evaluation framework for assessing the explainability of graph neural network (GNN) models.
The authors argue that current evaluation methods often rely on simplistic synthetic datasets, leading to limited conclusions about GNN explainability.
They introduce a new metric that combines fidelity measures to classify the quality of explanations as sufficient or necessary.
The paper compares several input-level explainability techniques for GNNs on both synthetic and real-world datasets.

Plain English Explanation

Graph neural networks (GNNs) are a popular type of machine learning model that can analyze data represented as graphs, such as social networks or transportation systems. As these models become more widely used, especially in sensitive applications, there is growing interest in understanding how they work and why they make the decisions they do - a field known as explainability.

Unfortunately, the current ways of evaluating GNN explainability often use simplified, synthetic datasets that don't reflect the complexity of real-world problems. This makes it difficult to draw meaningful conclusions about how well these explainability techniques actually perform.

To address this, the researchers in this paper developed a new framework for systematically evaluating GNN explainability. At the heart of their approach is a unique metric that looks at both the accuracy of the explanations (how well they match the model's inner workings) and their sufficiency (whether they provide enough information to understand the model's decisions).

Using this framework, the researchers compared several different techniques for explaining the decisions of GNN models. They found that on the simple synthetic datasets commonly used today, even relatively basic methods like personalized PageRank performed well. However, when the graph structure and node features became more complex (as in real-world scenarios), more sophisticated gradient-based techniques tended to be more effective.

Importantly, the researchers didn't find any single "best" explainability technique that dominated across all evaluation criteria. Instead, there seems to be a tradeoff between different qualities of the explanations. This highlights the need for a more nuanced, multifaceted approach to assessing GNN explainability, which is what this new framework aims to provide.

To demonstrate the practical value of their work, the researchers also applied their evaluation protocol to a case study on detecting fraudulent transactions in eBay's graph-structured data. This shows how their framework can be used to guide the development and deployment of explainable GNN models in real-world, mission-critical applications.

Technical Explanation

The paper presents a new systematic evaluation framework for assessing the explainability of graph neural network (GNN) models. The authors argue that current evaluation approaches often rely on simplistic synthetic datasets, leading to conclusions of limited scope and applicability.

To address this gap, the researchers propose a unique evaluation metric that combines fidelity measures to classify the quality of explanations as either sufficient or necessary. They then use this framework to compare several prominent input-level explainability techniques for GNNs, including personalized PageRank, gradient-based methods, and others.

Experiments are conducted on both synthetic benchmarks as well as a real-world case study of fraud detection on eBay's transaction graph. The results show that on the simple synthetic datasets, even relatively shallow techniques like personalized PageRank can perform well in terms of computational efficiency and explanation quality.

However, when the graph structure and node features become more complex (as in the real-world scenario), the gradient-based explainability methods tend to outperform the simpler approaches. Importantly, the authors find that no single technique dominates across all evaluation dimensions - there is always a tradeoff between different desirable properties of the explanations.

The researchers also apply their evaluation protocol to the eBay fraud detection case study, demonstrating how the framework can be used to guide the development and deployment of explainable GNN models in production environments.

Critical Analysis

The authors make a compelling case for the need to move beyond the limitations of existing GNN explainability evaluation frameworks, which often rely on simplistic synthetic datasets. Their proposed evaluation protocol, with its unique combination of fidelity and sufficiency metrics, represents an important step forward in establishing more rigorous and holistic evaluation criteria for this critical area of research.

That said, the paper does not address certain potential shortcomings or areas for further work. For example, the evaluation is scoped only to node classification tasks, whereas many real-world applications of GNNs may involve different types of predictions or graph-level analyses. Expanding the framework to handle a broader range of GNN use cases could further enhance its practical utility.

Additionally, while the authors demonstrate the framework's applicability to a real-world case study, more extensive validation on a diverse set of production-scale datasets and tasks would help strengthen the generalizability of their findings. Exploring the performance of their approach on dynamic graph structures could also yield valuable insights.

Finally, the paper does not delve into the potential sociotechnical implications of GNN explainability, such as how these techniques may be applied (or misapplied) in high-stakes domains like criminal justice or healthcare. Considering the human-centered design requirements for GNN explainability systems could help bridge the gap between technical advances and real-world deployment.

Overall, this paper represents an important contribution to the field of GNN explainability, but there remains ample room for further research and refinement to ensure these techniques are deployed responsibly and effectively in practice.

Conclusion

This paper proposes a systematic evaluation framework for assessing the explainability of graph neural network (GNN) models, addressing the limitations of current approaches that rely on simplistic synthetic datasets. The authors introduce a unique metric that combines fidelity and sufficiency measures to classify the quality of explanations, and use this framework to compare several prominent input-level explainability techniques for GNNs.

The results reveal that while simpler methods like personalized PageRank can perform well on basic synthetic benchmarks, more sophisticated gradient-based techniques tend to be more effective when dealing with complex real-world graph structures and node features. Importantly, the authors find that no single explainability technique dominates across all evaluation criteria, highlighting the need for a nuanced, multifaceted approach to assessing GNN explainability.

By applying their evaluation protocol to a case study on eBay's fraud detection graph, the researchers demonstrate the practical value of their framework in guiding the development and deployment of explainable GNN models in mission-critical applications. As GNNs continue to be adopted in high-stakes domains, this work represents an important step towards ensuring these powerful models can be understood and trusted by end-users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧠

New!Explaining the Explainers in Graph Neural Networks: a Comparative Study

Antonio Longa, Steve Azzolin, Gabriele Santin, Giulia Cencetti, Pietro Li`o, Bruno Lepri, Andrea Passerini

Following a fast initial breakthrough in graph based learning, Graph Neural Networks (GNNs) have reached a widespread application in many science and engineering fields, prompting the need for methods to understand their decision process. GNN explainers have started to emerge in recent years, with a multitude of methods both novel or adapted from other domains. To sort out this plethora of alternative approaches, several studies have benchmarked the performance of different explainers in terms of various explainability metrics. However, these earlier works make no attempts at providing insights into why different GNN architectures are more or less explainable, or which explainer should be preferred in a given setting. In this survey, we fill these gaps by devising a systematic experimental study, which tests ten explainers on eight representative architectures trained on six carefully designed graph and node classification datasets. With our results we provide key insights on the choice and applicability of GNN explainers, we isolate key components that make them usable and successful and provide recommendations on how to avoid common interpretation pitfalls. We conclude by highlighting open questions and directions of possible future research.

7/2/2024

cs.LG cs.AI

GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

Hsiao-Ying Lu, Yiran Li, Ujwal Pratap Krishna Kaluvakolanu Thyagarajan, Kwan-Liu Ma

Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face limitations in systematically exploring diverse substructures and evaluating results in the absence of ground truths. To address this gap, we introduce GNNAnatomy, a model- and dataset-agnostic visual analytics system designed to facilitate the generation and evaluation of multi-level explanations for GNNs. In GNNAnatomy, we employ graphlets to elucidate GNN behavior in graph-level classification tasks. By analyzing the associations between GNN classifications and graphlet frequencies, we formulate hypothesized factual and counterfactual explanations. To validate a hypothesized graphlet explanation, we introduce two metrics: (1) the correlation between its frequency and the classification confidence, and (2) the change in classification confidence after removing this substructure from the original graph. To demonstrate the effectiveness of GNNAnatomy, we conduct case studies on both real-world and synthetic graph datasets from various domains. Additionally, we qualitatively compare GNNAnatomy with a state-of-the-art GNN explainer, demonstrating the utility and versatility of our design.

6/10/2024

cs.LG cs.IR cs.SI

Graph Neural Network Explanations are Fragile

Jiate Li, Meng Pang, Yun Dong, Jinyuan Jia, Binghui Wang

Explainable Graph Neural Network (GNN) has emerged recently to foster the trust of using GNNs. Existing GNN explainers are developed from various perspectives to enhance the explanation performance. We take the first step to study GNN explainers under adversarial attack--We found that an adversary slightly perturbing graph structure can ensure GNN model makes correct predictions, but the GNN explainer yields a drastically different explanation on the perturbed graph. Specifically, we first formulate the attack problem under a practical threat model (i.e., the adversary has limited knowledge about the GNN explainer and a restricted perturbation budget). We then design two methods (i.e., one is loss-based and the other is deduction-based) to realize the attack. We evaluate our attacks on various GNN explainers and the results show these explainers are fragile.

6/6/2024

cs.CR cs.LG

Explainable Graph Neural Networks Under Fire

Zhong Li, Simon Geisler, Yuhang Wang, Stephan Gunnemann, Matthijs van Leeuwen

Predictions made by graph neural networks (GNNs) usually lack interpretability due to their complex computational behavior and the abstract nature of graphs. In an attempt to tackle this, many GNN explanation methods have emerged. Their goal is to explain a model's predictions and thereby obtain trust when GNN models are deployed in decision critical applications. Most GNN explanation methods work in a post-hoc manner and provide explanations in the form of a small subset of important edges and/or nodes. In this paper we demonstrate that these explanations can unfortunately not be trusted, as common GNN explanation methods turn out to be highly susceptible to adversarial perturbations. That is, even small perturbations of the original graph structure that preserve the model's predictions may yield drastically different explanations. This calls into question the trustworthiness and practical utility of post-hoc explanation methods for GNNs. To be able to attack GNN explanation models, we devise a novel attack method dubbed textit{GXAttack}, the first textit{optimization-based} adversarial attack method for post-hoc GNN explanations under such settings. Due to the devastating effectiveness of our attack, we call for an adversarial evaluation of future GNN explainers to demonstrate their robustness.

6/11/2024

cs.LG cs.AI