Augmenting Biomedical Named Entity Recognition with General-domain Resources

Read original: arXiv:2406.10671 - Published 6/21/2024 by Yu Yin, Hyunjae Kim, Xiao Xiao, Chih Hsuan Wei, Jaewoo Kang, Zhiyong Lu, Hua Xu, Meng Fang, Qingyu Chen

👁️

Overview

Biomedical named entity recognition (BioNER) models are used to identify and classify important entities like genes, diseases, and drugs in biomedical text.
Training these models typically requires extensive and costly human annotations.
Previous studies have used multi-task learning with multiple BioNER datasets, but this approach does not consistently improve performance and can introduce label ambiguity.
This paper proposes a method called GERBERA that leverages a general-domain named entity recognition (NER) dataset to improve BioNER model performance.

Plain English Explanation

The paper describes a new way to train BioNER models more efficiently. BioNER models are used to automatically find and categorize important biomedical terms like gene names and disease names in scientific literature. Training these models usually requires a lot of time and effort from human experts to manually annotate large datasets.

Previous research has tried using multiple BioNER datasets together to train the models, in a technique called multi-task learning. However, this approach doesn't always work well and can sometimes lead to confusion because the different datasets use different labeling schemes.

The researchers in this paper came up with a simpler but more effective method called GERBERA. Instead of using multiple biomedical datasets, GERBERA uses a general-purpose NER dataset that covers everyday entities like people, locations, and organizations. The researchers first train the model on this general dataset, and then fine-tune it specifically for the target BioNER task.

Despite using fewer specialized biomedical resources, the GERBERA models outperformed other state-of-the-art BioNER models in most of the experiments. The improvement was especially significant for BioNER datasets with limited training data, where GERBERA gave a 4.7% boost in performance.

The key idea is that learning general language patterns from the more abundant non-biomedical dataset can help the model better recognize biomedical entities, even with less direct biomedical training data. This transfer learning approach is more efficient and effective than relying solely on expensive custom-annotated biomedical datasets.

Technical Explanation

The paper proposes a transfer learning approach called GERBERA (General-domain named entity Recognition for BiomedicalText) to improve the performance of BioNER models. Rather than using multiple BioNER datasets in a multi-task learning setup, which can lead to label ambiguity, GERBERA leverages a general-domain NER dataset as an auxiliary task.

Specifically, the researchers first pre-train a biomedical language model using both the target BioNER dataset and the general-domain NER dataset in a multi-task learning framework. This allows the model to learn transferable language representations from the more abundant general-domain data. Then, they fine-tune the model solely on the target BioNER dataset.

The researchers systematically evaluated GERBERA on five BioNER datasets covering eight entity types, totaling 81,410 instances. Despite using fewer biomedical resources, the GERBERA models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. The GERBERA models outperformed the baselines in six out of eight entity types, with an average 0.9% improvement in F1 score.

The method was particularly effective for BioNER datasets with limited training data. On the JNLPBA-RNA dataset, GERBERA achieved a 4.7% higher F1 score compared to the best baseline. This suggests that the general-domain pre-training helps the model generalize better, especially when direct biomedical training data is scarce.

Critical Analysis

The paper provides a compelling approach to improve BioNER model performance while reducing the need for expensive, custom-annotated biomedical datasets. The key insight of leveraging general-domain data through transfer learning is an elegant solution to the data scarcity challenge in the biomedical domain.

However, the paper does not extensively explore the limitations of the GERBERA method. For example, it is unclear how the approach would perform on BioNER tasks that are very different from the general-domain NER dataset used in the experiments. The researchers also do not discuss potential biases or mistakes that could arise from applying a model trained on non-biomedical data to a specialized biomedical domain.

Additionally, the paper does not compare GERBERA to other transfer learning approaches, such as DeviceBERT or DisTALANER, which also aim to reduce the need for extensive biomedical annotations. Situating GERBERA in the context of related transfer learning techniques for BioNER would provide a more comprehensive evaluation.

Despite these limitations, the GERBERA method represents a promising direction for improving the efficiency and performance of BioNER models, particularly for datasets with limited training data. The researchers should be commended for their innovative approach to leveraging general-domain resources to tackle a challenging problem in the biomedical domain.

Conclusion

This paper introduces GERBERA, a transfer learning method that uses a general-domain named entity recognition dataset to improve the performance of biomedical named entity recognition (BioNER) models. By pre-training on the general-domain dataset and then fine-tuning on the target BioNER dataset, GERBERA achieves superior results compared to models trained solely on multiple BioNER datasets.

The key innovation of GERBERA is its ability to leverage more abundant and accessible general-domain data to learn transferable language representations, which can then be effectively applied to the specialized biomedical domain. This approach is especially beneficial for BioNER datasets with limited training data, where GERBERA demonstrated significant performance gains.

The GERBERA method represents an important step towards making BioNER models more efficient and effective, reducing the need for costly human annotations. As biomedical research continues to produce vast amounts of textual data, tools like GERBERA will become increasingly valuable for automatically extracting and organizing critical information from scientific literature.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Augmenting Biomedical Named Entity Recognition with General-domain Resources

Yu Yin, Hyunjae Kim, Xiao Xiao, Chih Hsuan Wei, Jaewoo Kang, Zhiyong Lu, Hua Xu, Meng Fang, Qingyu Chen

Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. In this paper, we proposed GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training. Specifically, we performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with multiple additional BioNER datasets. Specifically, our models consistently outperformed the baselines in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight biomedical entity types sourced from five different corpora. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.

6/21/2024

BioMNER: A Dataset for Biomedical Method Entity Recognition

Chen Tang, Bohao Yang, Kun Zhao, Bo Lv, Chenghao Xiao, Frank Guerin, Chenghua Lin

Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.

7/1/2024

👁️

From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts

Milov{s} Kov{s}prdi'c, Nikola Prodanovi'c, Adela Ljaji'c, Bojana Bav{s}aragin, Nikola Milov{s}evi'c

Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.

8/27/2024

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Yuming Yang, Wantong Zhao, Caishuang Huang, Junjie Ye, Xiao Wang, Huiyuan Zheng, Yang Nan, Yuran Wang, Xueying Xu, Kaixin Huang, Yunke Zhang, Tao Gui, Qi Zhang, Xuanjing Huang

Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets faces issues due to inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain generalization. To address this, we present B2NERD, a cohesive and efficient dataset for Open NER, normalized from 54 existing English or Chinese datasets using a two-step approach. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly improves LLMs' generalization on Open NER. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

6/18/2024