AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Read original: arXiv:2404.07765 - Published 4/12/2024 by Lukas Lange, Marc Muller, Ghazaleh Haratinezhad Torbati, Dragan Milchevski, Patrick Grau, Subhash Pujari, Annemarie Friedrich

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Overview

• This paper introduces AnnoCTR, a new dataset for detecting and linking entities, tactics, and techniques in cyber threat reports. • The dataset contains annotations for named entities, tactics, and techniques in real-world cyber threat reports, as well as links between these elements. • The authors argue that this dataset can help improve natural language processing (NLP) models for understanding and analyzing cyber threat intelligence.

Plain English Explanation

Cyber threat reports are documents that describe online attacks, security vulnerabilities, and other cybersecurity-related information. These reports are important for helping organizations identify and defend against potential threats. However, extracting useful information from these reports can be challenging, as they often contain complex technical language and jargon.

The researchers who created the AnnoCTR dataset wanted to make it easier to analyze the contents of cyber threat reports using natural language processing (NLP) techniques. They manually annotated a large collection of cyber threat reports, identifying and labeling different types of information, such as: • Named entities: The specific people, organizations, software, and other things mentioned in the reports. • Tactics: The methods used by attackers to carry out their activities. • Techniques: The specific actions or tools used as part of those tactics.

The researchers also identified relationships between these different elements, linking tactics to the techniques used to implement them, and linking entities to the tactics and techniques associated with them.

By providing this annotated dataset, the researchers hope to enable the development of more advanced NLP models that can automatically extract and connect this type of information from cyber threat reports. This could help security analysts and researchers better understand and respond to emerging threats.

Technical Explanation

The AnnoCTR dataset contains annotations for 1,000 cyber threat reports from a variety of sources, including government agencies, security companies, and cybersecurity research organizations. The reports cover a wide range of topics, including malware, ransomware, advanced persistent threats, and other cybersecurity-related issues.

The authors used a team of expert annotators to manually identify and label the following elements in the reports: • Named entities: This includes organizations, software, and other specific things mentioned in the text. • Tactics: High-level methods used by attackers, such as "Reconnaissance" or "Lateral Movement." • Techniques: Specific actions or tools used as part of those tactics, such as "Phishing" or "Credential Dumping."

The annotators also identified links between these different elements, such as which techniques are associated with particular tactics, or which entities are involved in carrying out specific tactics and techniques.

The resulting dataset provides a rich source of information for training and evaluating natural language processing models for tasks like named entity recognition, relation extraction, and text classification. The authors demonstrate the utility of the dataset through a series of experiments, showing that models trained on AnnoCTR can outperform those trained on more generic NLP datasets when it comes to understanding and analyzing the contents of cyber threat reports.

Critical Analysis

The AnnoCTR dataset represents a valuable resource for advancing natural language processing capabilities in the cybersecurity domain. By providing a large corpus of manually annotated cyber threat reports, the researchers have created a valuable benchmark for evaluating and improving NLP models that need to work with this type of specialized technical content.

However, the authors acknowledge several limitations of the dataset. First, the annotations were created by a team of experts, which could introduce some subjectivity and inconsistencies. Additionally, the dataset is limited to 1,000 reports, which may not be sufficient to capture the full diversity of language and concepts found in real-world cyber threat intelligence.

The authors also note that the dataset focuses primarily on textual content, and does not include other modalities (e.g., images, tables, code snippets) that are often present in cyber threat reports. Expanding the dataset to include these other data types could further enhance its usefulness for developing comprehensive NLP solutions for this domain.

Overall, the AnnoCTR dataset represents an important step forward in enabling the application of advanced NLP techniques to the challenge of automatically extracting and connecting key information from cyber threat reports. With further refinement and expansion, this resource could become a valuable tool for researchers and security practitioners working to improve threat intelligence capabilities.

Conclusion

The AnnoCTR dataset introduced in this paper provides a valuable new resource for developing and evaluating natural language processing models in the cybersecurity domain. By annotating a large corpus of real-world cyber threat reports with labeled entities, tactics, techniques, and relationships, the researchers have created a robust benchmark for training and testing NLP systems that need to understand and extract actionable information from this type of specialized technical content.

While the dataset has some limitations, it represents an important step forward in enabling the application of advanced language processing techniques to the challenge of automating the analysis of cyber threat intelligence. As researchers and practitioners continue to build upon this foundation, the insights and capabilities unlocked by AnnoCTR could have significant implications for improving the speed, accuracy, and scalability of threat detection and response efforts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Lukas Lange, Marc Muller, Ghazaleh Haratinezhad Torbati, Dragan Milchevski, Patrick Grau, Subhash Pujari, Annemarie Friedrich

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

4/12/2024

AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset

Pritam Deka, Sampath Rajapaksha, Ruby Rani, Amirah Almutairi, Erisa Karafili

Cyber-attack attribution is an important process that allows experts to put in place attacker-oriented countermeasures and legal actions. The analysts mainly perform attribution manually, given the complex nature of this task. AI and, more specifically, Natural Language Processing (NLP) techniques can be leveraged to support cybersecurity analysts during the attribution process. However powerful these techniques are, they need to deal with the lack of datasets in the attack attribution domain. In this work, we will fill this gap and will provide, to the best of our knowledge, the first dataset on cyber-attack attribution. We designed our dataset with the primary goal of extracting attack attribution information from cybersecurity texts, utilizing named entity recognition (NER) methodologies from the field of NLP. Unlike other cybersecurity NER datasets, ours offers a rich set of annotations with contextual details, including some that span phrases and sentences. We conducted extensive experiments and applied NLP techniques to demonstrate the dataset's effectiveness for attack attribution. These experiments highlight the potential of Large Language Models (LLMs) capabilities to improve the NER tasks in cybersecurity datasets for cyber-attack attribution.

8/12/2024

🧠

New!LSTM Recurrent Neural Networks for Cybersecurity Named Entity Recognition

Houssem Gasmi (DISP), Jannik Laval (DISP), Abdelaziz Bouras (DISP)

The automated and timely conversion of cybersecurity information from unstructured online sources, such as blogs and articles to more formal representations has become a necessity for many applications in the domain nowadays. Named Entity Recognition (NER) is one of the early phases towards this goal. It involves the detection of the relevant domain entities, such as product, version, attack name, etc. in technical documents. Although generally considered a simple task in the information extraction field, it is quite challenging in some domains like cybersecurity because of the complex structure of its entities. The state of the art methods require time-consuming and labor intensive feature engineering that describes the properties of the entities, their context, domain knowledge, and linguistic characteristics. The model demonstrated in this paper is domain independent and does not rely on any features specific to the entities in the cybersecurity domain, hence does not require expert knowledge to perform feature engineering. The method used relies on a type of recurrent neural networks called Long Short-Term Memory (LSTM) and the Conditional Random Fields (CRFs) method. The results we obtained showed that this method outperforms the state of the art methods given an annotated corpus of a decent size.

9/18/2024

👁️

Unveiling Social Media Comments with a Novel Named Entity Recognition System for Identity Groups

Andr'es Carvallo, Tamara Quiroga, Carlos Aspillaga, Marcelo Mendoza

While civilized users employ social media to stay informed and discuss daily occurrences, haters perceive these platforms as fertile ground for attacking groups and individuals. The prevailing approach to counter this phenomenon involves detecting such attacks by identifying toxic language. Effective platform measures aim to report haters and block their network access. In this context, employing hate speech detection methods aids in identifying these attacks amidst vast volumes of text, which are impossible for humans to analyze manually. In our study, we expand upon the usual hate speech detection methods, typically based on text classifiers, to develop a Named Entity Recognition (NER) System for Identity Groups. To achieve this, we created a dataset that allows extending a conventional NER to recognize identity groups. Consequently, our tool not only detects whether a sentence contains an attack but also tags the sentence tokens corresponding to the mentioned group. Results indicate that the model performs competitively in identifying groups with an average f1-score of 0.75, outperforming in identifying ethnicity attack spans with an f1-score of 0.80 compared to other identity groups. Moreover, the tool shows an outstanding generalization capability to minority classes concerning sexual orientation and gender, achieving an f1-score of 0.77 and 0.72, respectively. We tested the utility of our tool in a case study on social media, annotating and comparing comments from Facebook related to news mentioning identity groups. The case study reveals differences in the types of attacks recorded, effectively detecting named entities related to the categories of the analyzed news articles. Entities are accurately tagged within their categories, with a negligible error rate for inter-category tagging.

5/24/2024