CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

2402.07688

Published 6/4/2024 by Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamas Bisztray, Merouane Debbah

🛸

Abstract

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance, and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b.

Create account to get full access

Overview

This research presents a series of multiple-choice Q&A benchmark datasets called CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000 to evaluate the cybersecurity knowledge of large language models (LLMs).
The datasets were created by collecting relevant documents, including NIST standards, research papers, books, and RFCs, and using tools like GPT-3.5 and Retrieval-Augmented Generation (RAG) to generate questions with multiple-choice answers.
The questions were thoroughly validated by human experts to ensure accuracy and relevance to the cybersecurity domain.
The researchers evaluated and compared the performance of 25 state-of-the-art LLM models on the CyberMetric datasets and also involved 30 human participants to solve the CyberMetric-80 dataset in a closed-book scenario.

Plain English Explanation

The researchers wanted to test how well large language models (LLMs) understand the different areas of cybersecurity, which can be quite complex even for human experts. To do this, they created a series of multiple-choice question and answer datasets called CyberMetric. These datasets cover a wide range of cybersecurity topics, such as cryptography, reverse engineering, and risk assessment.

The researchers used tools like GPT-3.5 and Retrieval-Augmented Generation to collect relevant documents, such as government standards and research papers, and generate questions and answers based on the information in these documents. Then, they had human experts spend over 200 hours reviewing the questions and answers to make sure they were accurate and relevant to cybersecurity.

After creating the CyberMetric datasets, the researchers tested 25 different LLMs on them to see how well the models performed on cybersecurity knowledge. They also had 30 human participants solve the CyberMetric-80 dataset to compare the performance of the LLMs to humans. The results showed that some of the most advanced LLMs, like GPT-4o and GPT-4-turbo, were more accurate than the human participants on the CyberMetric-80 dataset. However, the most experienced human experts still outperformed some of the smaller LLM models.

Technical Explanation

The researchers created the CyberMetric datasets to address the challenge of evaluating the general cybersecurity knowledge of large language models (LLMs). They collected a diverse set of documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, using tools like GPT-3.5 and Retrieval-Augmented Generation (RAG).

These documents were then used to generate multiple-choice questions, each with four possible answers, covering a wide range of cybersecurity topics. The questions and solutions underwent several rounds of error checking and refinement by human experts, who spent over 200 hours validating the content to ensure its accuracy and relevance to the cybersecurity field.

The researchers evaluated 25 state-of-the-art LLM models, including GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0, on the CyberMetric datasets. In addition, they involved 30 human participants to solve the CyberMetric-80 dataset in a closed-book scenario to provide a baseline for comparing the cybersecurity knowledge of humans and LLMs.

Critical Analysis

The researchers acknowledge that the CyberMetric datasets are not exhaustive and may not cover all aspects of cybersecurity. They also note that the validation process, while extensive, may have missed some inaccuracies or irrelevant questions. Additionally, the performance of LLMs may be influenced by factors such as training data and fine-tuning, which were not explored in-depth in this study.

The comparison between LLMs and humans is limited to the CyberMetric-80 dataset, and it would be valuable to extend this comparison to the larger datasets to gain a more comprehensive understanding of their relative strengths and weaknesses in the cybersecurity domain.

Conclusion

The CyberMetric datasets provide a valuable resource for evaluating the cybersecurity knowledge of large language models, which can have important applications in areas like software development and cyber threat intelligence. The findings of this study suggest that some of the most advanced LLMs are capable of outperforming humans on general cybersecurity knowledge, although highly experienced experts still maintain an advantage in certain areas. This research highlights the potential of LLMs in the cybersecurity field, while also underscoring the need for continued evaluation and improvement of these models to ensure their reliability and effectiveness.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SECURE: Benchmarking Generative Large Language Models for Cybersecurity Advisory

Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, Nidhi Rastogi

Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness. Existing benchmarks provide general evaluations but do not sufficiently address the practical and applied aspects of LLM performance in cybersecurity-specific tasks. To address this gap, we introduce the SECURE (Security Extraction, Understanding & Reasoning Evaluation), a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios. SECURE includes six datasets focussed on the Industrial Control System sector to evaluate knowledge extraction, understanding, and reasoning based on industry-standard sources. Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and offer recommendations for improving LLMs reliability as cyber advisory tools.

6/3/2024

cs.CR cs.AI cs.HC

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with borderline benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs.

4/23/2024

cs.CR cs.LG

🤖

Generative AI and Large Language Models for Cyber Security: All Insights You Need

Mohamed Amine Ferrag, Fatima Alwahedi, Ammar Battah, Bilel Cherif, Abdechakour Mechri, Norbert Tihanyi

This paper provides a comprehensive review of the future of cybersecurity through Generative AI and Large Language Models (LLMs). We explore LLM applications across various domains, including hardware design security, intrusion detection, software engineering, design verification, cyber threat intelligence, malware detection, and phishing detection. We present an overview of LLM evolution and its current state, focusing on advancements in models such as GPT-4, GPT-3.5, Mixtral-8x7B, BERT, Falcon2, and LLaMA. Our analysis extends to LLM vulnerabilities, such as prompt injection, insecure output handling, data poisoning, DDoS attacks, and adversarial instructions. We delve into mitigation strategies to protect these models, providing a comprehensive look at potential attack scenarios and prevention techniques. Furthermore, we evaluate the performance of 42 LLM models in cybersecurity knowledge and hardware security, highlighting their strengths and weaknesses. We thoroughly evaluate cybersecurity datasets for LLM training and testing, covering the lifecycle from data creation to usage and identifying gaps for future research. In addition, we review new strategies for leveraging LLMs, including techniques like Half-Quadratic Quantization (HQQ), Reinforcement Learning with Human Feedback (RLHF), Direct Preference Optimization (DPO), Quantized Low-Rank Adapters (QLoRA), and Retrieval-Augmented Generation (RAG). These insights aim to enhance real-time cybersecurity defenses and improve the sophistication of LLM applications in threat detection and response. Our paper provides a foundational understanding and strategic direction for integrating LLMs into future cybersecurity frameworks, emphasizing innovation and robust model deployment to safeguard against evolving cyber threats.

5/22/2024

cs.CR cs.AI

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, Nidhi Rastogi

Cyber threat intelligence (CTI) is crucial in today's cybersecurity landscape, providing essential insights to understand and mitigate the ever-evolving cyber threats. The recent rise of Large Language Models (LLMs) have shown potential in this domain, but concerns about their reliability, accuracy, and hallucinations persist. While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks. To bridge this gap, we introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating knowledge acquired by LLMs in the cyber-threat landscape. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts, contributing to a better understanding of LLM capabilities in CTI.

6/26/2024

cs.CR cs.AI