Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets

2406.02606

Published 6/6/2024 by Joshua Collyer, Tim Watson, Iain Phillips

Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets

Abstract

Binary code similarity detection is an important problem with applications in areas like malware analysis, vulnerability research and plagiarism detection. This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets. A call graphlet encodes the neighborhood around each function in a binary executable, capturing the local and global context through a series of statistical features. A specialized graph neural network model is then designed to operate on this graph representation, learning to map it to a feature vector that encodes semantic code similarities using deep metric learning. The proposed approach is evaluated across four distinct datasets covering different architectures, compiler toolchains, and optimization levels. Experimental results demonstrate that the combination of call graphlets and the novel graph neural network architecture achieves state-of-the-art performance compared to baseline techniques across cross-architecture, mono-architecture and zero shot tasks. In addition, our proposed approach also performs well when evaluated against an out-of-domain function inlining task. Overall, the work provides a general and effective graph neural network-based solution for conducting binary code similarity detection.

Create account to get full access

Overview

This research paper introduces a novel approach to binary function search powered by call graphlets, which enables general and zero-shot capable binary function search.
The key contributions include a new representation of binary functions using call graphlets, a zero-shot capable binary function search approach, and comprehensive evaluations on real-world datasets.
The research leverages graph neural networks for binary programming and techniques for uncovering LLM-generated code to achieve these advancements.

Plain English Explanation

The paper presents a new way to search for and identify binary functions, which are the basic building blocks of computer programs. Traditionally, searching for binary functions has been a challenging task, as they can be obfuscated or modified in complex ways. The researchers have developed a novel approach that uses "call graphlets" - small, representative subgraphs extracted from the call graph of a binary function - to create a unique representation of each function.

This representation allows the researchers to perform general and zero-shot capable binary function search. "General" means the system can search for functions regardless of the programming language or compilation process used to create them. "Zero-shot" means the system can identify functions it has never seen before, without any additional training. This is a significant advancement, as it allows the system to be used in a wider range of scenarios, such as detecting source code clones or explaining the behavior of binary programs.

The researchers evaluate their approach on real-world datasets and demonstrate its effectiveness in accurately identifying binary functions, even in the face of obfuscation or other challenges. This work represents an important step forward in the field of binary analysis, with potential applications in cybersecurity, software engineering, and other domains.

Technical Explanation

The core of the researchers' approach is the use of call graphlets to represent binary functions. A call graph is a visual representation of the function calls within a program, and a graphlet is a small, representative subgraph extracted from this larger graph. By representing each binary function as a collection of call graphlets, the researchers are able to create a unique "fingerprint" for each function that captures its structure and behavior.

To perform binary function search, the researchers use a differentiable cluster graph neural network model to learn the representations of the call graphlets. This allows the model to generalize to new, unseen functions, enabling the zero-shot capability. The researchers also incorporate techniques from the field of uncovering LLM-generated code to further enhance the model's ability to identify novel functions.

Through comprehensive evaluations on real-world datasets, the researchers demonstrate that their approach outperforms existing binary function search methods, particularly in scenarios where the functions have been obfuscated or modified. They also discuss the potential limitations of their approach, such as the need for further research to address more advanced obfuscation techniques.

Critical Analysis

The researchers have made a significant contribution to the field of binary analysis with their novel approach to binary function search. The use of call graphlets as a representation of binary functions is a clever and effective idea, as it captures the structure and behavior of the functions in a way that is both unique and generalizable.

One potential limitation of the approach, as mentioned in the paper, is its ability to handle more advanced obfuscation techniques. While the researchers have demonstrated the effectiveness of their method on real-world datasets, it's possible that more sophisticated obfuscation techniques could still pose a challenge. Additionally, the researchers do not address the potential ethical implications of their work, such as the potential for misuse in malware analysis or reverse engineering.

That said, the researchers' work represents an important step forward in the field of binary analysis, with potential applications in cybersecurity, software engineering, and other domains. The use of graph neural networks and techniques from the field of uncovering LLM-generated code is particularly promising, and the researchers' focus on generalization and zero-shot capability is a valuable contribution.

Conclusion

The research paper introduces a novel approach to binary function search powered by call graphlets, which enables general and zero-shot capable binary function search. This work represents a significant advancement in the field of binary analysis, with potential applications in cybersecurity, software engineering, and other domains. The use of call graphlets as a representation of binary functions, combined with the researchers' innovative use of graph neural networks and techniques from the field of uncovering LLM-generated code, allows for highly effective and generalizable binary function search. While the approach has some limitations, particularly in its ability to handle advanced obfuscation techniques, the researchers have demonstrated the power and potential of their approach through comprehensive evaluations on real-world datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

Jorge Martinez-Gil

The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.

5/6/2024

cs.SE cs.AI

Graph Neural Networks for Binary Programming

Moshe Eliasof, Eldad Haber

This paper investigates a link between Graph Neural Networks (GNNs) and Binary Programming (BP) problems, laying the groundwork for GNNs to approximate solutions for these computationally challenging problems. By analyzing the sensitivity of BP problems, we are able to frame the solution of BP problems as a heterophilic node classification task. We then propose Binary-Programming GNN (BPGNN), an architecture that integrates graph representation learning techniques with BP-aware features to approximate BP solutions efficiently. Additionally, we introduce a self-supervised data generation mechanism, to enable efficient and tractable training data acquisition even for large-scale BP problems. Experimental evaluations of BPGNN across diverse BP problem sizes showcase its superior performance compared to exhaustive search and heuristic approaches. Finally, we discuss open challenges in the under-explored field of BP problems with GNNs.

4/9/2024

cs.LG cs.AI cs.ET

New!ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs

Yuhan Li, Peisong Wang, Zhixun Li, Jeffrey Xu Yu, Jia Li

With the development of foundation models such as large language models, zero-shot transfer learning has become increasingly significant. This is highlighted by the generative capabilities of NLP models like GPT-4, and the retrieval-based approaches of CV models like CLIP, both of which effectively bridge the gap between seen and unseen data. In the realm of graph learning, the continuous emergence of new graphs and the challenges of human labeling also amplify the necessity for zero-shot transfer learning, driving the exploration of approaches that can generalize across diverse graph data without necessitating dataset-specific and label-specific fine-tuning. In this study, we extend such paradigms to zero-shot transferability in graphs by introducing ZeroG, a new framework tailored to enable cross-dataset generalization. Addressing the inherent challenges such as feature misalignment, mismatched label spaces, and negative transfer, we leverage a language model to encode both node attributes and class semantics, ensuring consistent feature dimensions across datasets. We also propose a prompt-based subgraph sampling module that enriches the semantic information and structure information of extracted subgraphs using prompting nodes and neighborhood aggregation, respectively. We further adopt a lightweight fine-tuning strategy that reduces the risk of overfitting and maintains the zero-shot learning efficacy of the language model. The results underscore the effectiveness of our model in achieving significant cross-dataset zero-shot transferability, opening pathways for the development of graph foundation models. Codes and data are available at https://github.com/NineAbyss/ZeroG.

6/26/2024

cs.LG

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang

Large Language Models (LLMs) have exhibited remarkable proficiency in generating code. However, the misuse of LLM-generated (Synthetic) code has prompted concerns within both educational and industrial domains, highlighting the imperative need for the development of synthetic code detectors. Existing methods for detecting LLM-generated content are primarily tailored for general text and often struggle with code content due to the distinct grammatical structure of programming languages and massive low-entropy tokens. Building upon this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our method relies on the intuition that the differences between the LLM-rewritten and original codes tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and assess our approach on two synthetic code detection benchmarks. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts, with an improvement of 20.5% in the APPS benchmark and 29.1% in the MBPP benchmark.

5/31/2024

cs.SE cs.AI