Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus

Read original: arXiv:2405.11941 - Published 5/21/2024 by Fons Hartendorp, Tom Seinen, Erik van Mulligen, Suzan Verberne

Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus

Overview

This paper explores the task of biomedical entity linking for the Dutch language, which involves associating mentions of biomedical concepts in text with their corresponding entries in a knowledge base.
The researchers fine-tuned a self-alignment BERT model on an automatically generated Dutch Wikipedia corpus to improve performance on this task.
The model was evaluated on a new benchmark dataset for biomedical entity linking in Dutch, demonstrating strong results compared to previous approaches.

Plain English Explanation

Biomedical entity linking is the process of identifying and linking mentions of medical concepts (like diseases, drugs, or genes) in text to their corresponding entries in a database or knowledge base. This allows systems to better understand the meaning and context of biomedical information. However, most existing research has focused on English, while there is less work on other languages like Dutch.

To address this, the researchers in this paper fine-tuned a BERT language model on a large, automatically generated corpus of Dutch Wikipedia pages related to biomedical topics. BERT is a powerful deep learning model that can capture complex relationships in text. By fine-tuning it on Dutch biomedical data, the researchers were able to create a model specialized for the task of Dutch biomedical entity linking.

The researchers evaluated their fine-tuned model on a new benchmark dataset they created, which contains Dutch text annotated with biomedical concept mentions linked to a knowledge base. Their model achieved strong performance, outperforming previous approaches. This suggests their fine-tuning approach was effective at adapting the BERT model to the specific challenges of Dutch biomedical language.

Overall, this work helps advance the state-of-the-art in biomedical entity extraction and linking for the Dutch language, which has important applications in areas like medical information retrieval and automated analysis of biomedical literature.

Technical Explanation

The researchers in this paper developed a Dutch biomedical entity linking model by fine-tuning a self-alignment BERT architecture on an automatically generated Dutch Wikipedia corpus. Their key contributions are:

Corpus Creation: They constructed a large-scale Dutch biomedical corpus by automatically extracting relevant Wikipedia pages and annotating biomedical entity mentions using a knowledge base.
Model Fine-tuning: They fine-tuned a self-alignment BERT model, which learns to encode context-aware representations of text, on this Dutch biomedical corpus. This allows the model to better capture the nuances of Dutch biomedical language.
Benchmark Evaluation: The researchers evaluated their fine-tuned model on a new Dutch biomedical entity linking benchmark dataset they created. Their model achieved state-of-the-art performance, outperforming previous approaches.

The self-alignment BERT model first encodes the input text into contextual embeddings. It then performs entity linking by predicting a probability distribution over candidate entities in the knowledge base for each mention. The model is trained end-to-end using a combination of cross-entropy loss for entity prediction and consistency loss for self-alignment.

Experiments show the fine-tuned BERT model significantly outperforms prior methods based on lexical matching and traditional machine learning. This highlights the value of deep learning techniques like BERT for tackling complex biomedical language understanding tasks, even in lower-resource languages like Dutch.

Critical Analysis

The researchers acknowledge several limitations and areas for future work in this paper:

The automatically generated biomedical corpus, while large, may contain noisy or inaccurate annotations that could impact model performance.
The benchmark dataset is still relatively small, and may not fully capture the diversity of Dutch biomedical text.
The model was only evaluated on entity linking, but not other related tasks like entity extraction or normalization.

Additionally, while the fine-tuned BERT model shows strong results, there may be opportunities to further improve performance through techniques like ensemble modeling or incorporation of additional domain-specific knowledge.

It would also be valuable to see how this Dutch biomedical entity linking approach compares to multilingual models that could potentially be applied to Dutch without the need for language-specific fine-tuning.

Overall, this paper makes an important contribution to biomedical text mining for the Dutch language, but there is still room for further research and refinement of these methods.

Conclusion

This paper presents a novel approach for improving biomedical entity linking in Dutch by fine-tuning a self-alignment BERT model on an automatically generated Dutch Wikipedia corpus. The resulting model achieves state-of-the-art performance on a new Dutch biomedical entity linking benchmark.

This work helps advance natural language processing capabilities for the Dutch language, with important implications for applications like medical information retrieval, clinical decision support, and automated analysis of Dutch biomedical literature. The fine-tuning approach demonstrated here could also potentially be applied to other low-resource languages to develop specialized models for biomedical entity linking and related tasks.

While the results are promising, the researchers identify areas for future work to further improve the robustness and generalization of these techniques. Continued research in this direction has the potential to unlock new opportunities for leveraging biomedical text data to drive insights and innovations in healthcare and life sciences, even in languages that have traditionally been underserved by such technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus

Fons Hartendorp, Tom Seinen, Erik van Mulligen, Suzan Verberne

Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This paper presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as base model and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.

5/21/2024

Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques

Akshit Achara, Sanand Sasidharan, Gagan N

Clinical text is rich in information, with mentions of treatment, medication and anatomy among many other clinical terms. Multiple terms can refer to the same core concepts which can be referred as a clinical entity. Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities including the definitions, relations and other corresponding information. These ontologies are used for standardization of clinical text by normalizing varying surface forms of a clinical term through Biomedical entity linking. With the introduction of transformer-based language models, there has been significant progress in Biomedical entity linking. In this work, we focus on learning through synonym pairs associated with the entities. As compared to the existing approaches, our approach significantly reduces the training data and resource consumption. Moreover, we propose a suite of context-based and context-less reranking techniques for performing the entity disambiguation. Overall, we achieve similar performance to the state-of-the-art zero-shot and distant supervised entity linking techniques on the Medmentions dataset, the largest annotated dataset on UMLS, without any domain-based training. Finally, we show that retrieval performance alone might not be sufficient as an evaluation metric and introduce an article level quantitative and qualitative analysis to reveal further insights on the performance of entity linking methods.

5/28/2024

Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking

Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto

State-of-the-art deep learning entity linking methods rely on extensive human-labelled data, which is costly to acquire. Current datasets are limited in size, leading to inadequate coverage of biomedical concepts and diminished performance when applied to new data. In this work, we propose to automatically generate data to create large-scale training datasets, which allows the exploration of approaches originally developed for the task of extreme multi-label ranking in the biomedical entity linking task. We propose the hybrid X-Linker pipeline that includes different modules to link disease and chemical entity mentions to concepts in the MEDIC and the CTD-Chemical vocabularies, respectively. X-Linker was evaluated on several biomedical datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical, BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969, 0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining three datasets. Both models rely only on the mention string for their operations. The source code of X-Linker and its associated data are publicly available for performing biomedical entity linking without requiring pre-labelled entities with identifiers from specific knowledge organization systems.

7/10/2024

Biomedical Entity Linking as Multiple Choice Question Answering

Zhenxi Lin, Ziheng Zhang, Xian Wu, Yefeng Zheng

Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.

5/20/2024