Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

Read original: arXiv:2309.13061 - Published 4/24/2024 by Armando D. Diaz Gonzalez, Kevin S. Hughes, Songhui Yue, Sean T. Hayes

📉

Overview

This study analyzes germline abstracts to create a knowledge graph of genes and diseases.
The researchers employ BioBERT, a pre-trained language model, to extract genes and diseases.
They use an ontology-based and rule-based approach to standardize and disambiguate medical terms.
The study connects each entity to its data source using a part-whole relation approach and visualizes the knowledge graph.
The knowledge graph contains 297 genes, 130 diseases, and 46,747 triples.

Plain English Explanation

As the amount of biomedical information available rapidly increases, researchers have become interested in automating the process of extracting, normalizing, and representing this knowledge, particularly about genes and diseases. This study aims to create a knowledge graph that connects germline genes and diseases.

The researchers used a powerful language model called BioBERT, which has been pre-trained on biomedical data, to identify genes and diseases in the germline abstracts. They then developed an algorithm to standardize and disambiguate the medical terms, so that each entity is represented consistently.

To show the relationships between the different elements, the researchers used a "part-whole" approach, which connects each gene and disease to the data source it came from. This allows them to visualize the entire knowledge graph in a clear and informative way.

The resulting knowledge graph contains a significant amount of information, with 297 genes, 130 diseases, and 46,747 connections (or "triples") between them. This can be a powerful tool for researchers and clinicians working in the field of genetics and genomics.

Technical Explanation

The researchers employed BioBERT, a pre-trained BERT model on biomedical data, to extract genes and diseases from the germline abstracts. They then used an ontology-based and rule-based algorithm to standardize and disambiguate the medical terms, ensuring that each entity is represented consistently.

To capture the semantic relationships between the articles, genes, and diseases, the researchers implemented a part-whole relation approach. This allowed them to connect each entity to its data source and visualize the entire knowledge graph.

The resulting knowledge graph contains 297 genes, 130 diseases, and 46,747 triples. The researchers used graph-based visualizations to present the data, which can help researchers and clinicians better understand the complex relationships between genes and diseases.

Critical Analysis

The researchers acknowledge several limitations and challenges in their work. For example, they note that the ontology-based and rule-based approach to standardizing and disambiguating medical terms may not be fully comprehensive, and that more advanced natural language processing techniques could be explored in the future.

Additionally, the researchers mention that the knowledge graph is limited to the information contained in the germline abstracts, and that expanding the data sources could lead to a more comprehensive understanding of the relationships between genes and diseases.

Further research could also investigate the potential applications of this knowledge graph, such as automated text mining or integrating heterogeneous gene expression data, and explore ways to improve the overall quality and usefulness of the knowledge representation.

Conclusion

This study presents a novel approach to constructing a knowledge graph that connects germline genes and diseases. By leveraging advanced natural language processing techniques and a part-whole relation approach, the researchers were able to create a comprehensive and visually compelling representation of this biomedical knowledge.

While the knowledge graph has some limitations, it represents an important step forward in the automated extraction and normalization of biomedical information. The researchers' work could serve as a foundation for future studies aimed at improving the construction of knowledge graphs and exploring their potential applications in the field of genetics and genomics.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

Armando D. Diaz Gonzalez, Kevin S. Hughes, Songhui Yue, Sean T. Hayes

Published biomedical information has and continues to rapidly increase. The recent advancements in Natural Language Processing (NLP), have generated considerable interest in automating the extraction, normalization, and representation of biomedical knowledge about entities such as genes and diseases. Our study analyzes germline abstracts in the construction of knowledge graphs of the of the immense work that has been done in this area for genes and diseases. This paper presents SimpleGermKG, an automatic knowledge graph construction approach that connects germline genes and diseases. For the extraction of genes and diseases, we employ BioBERT, a pre-trained BERT model on biomedical corpora. We propose an ontology-based and rule-based algorithm to standardize and disambiguate medical terms. For semantic relationships between articles, genes, and diseases, we implemented a part-whole relation approach to connect each entity with its data source and visualize them in a graph-based knowledge representation. Lastly, we discuss the knowledge graph applications, limitations, and challenges to inspire the future research of germline corpora. Our knowledge graph contains 297 genes, 130 diseases, and 46,747 triples. Graph-based visualizations are used to show the results.

4/24/2024

🤿

BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction

Bridget T. McInnes, Jiawei Tang, Darshini Mahendran, Mai H. Nguyen

This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improvements, particularly in CPR groups shared between the datasets. The findings underscore the importance of dataset merging in augmenting sample counts and improving model accuracy. Moreover, the study highlights the potential of automated information extraction in biomedical research and clinical practice.

5/30/2024

Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens

The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer's disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models' competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers' ability to capture semantic relations.

9/9/2024

Generalized knowledge-enhanced framework for biomedical entity and relation extraction

Minh Nguyen, Phuong Le

In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding interaction detection and ADE for Adverse Drug Effect identification.

8/14/2024