BioMNER: A Dataset for Biomedical Method Entity Recognition

Read original: arXiv:2406.20038 - Published 7/1/2024 by Chen Tang, Bohao Yang, Kun Zhao, Bo Lv, Chenghao Xiao, Frank Guerin, Chenghua Lin

BioMNER: A Dataset for Biomedical Method Entity Recognition

Overview

This paper introduces a new dataset called BioMNER for biomedical method entity recognition.
The dataset contains annotations for various types of method entities found in biomedical literature, with the goal of improving natural language processing (NLP) tasks in this domain.
The authors also present baseline results for several state-of-the-art NLP models on the BioMNER dataset.

Plain English Explanation

The paper describes the creation of a new dataset called BioMNER, which is designed to help machines better understand the language used to describe research methods in biomedical literature. Biomedical research papers often contain specialized terminology and descriptions of experimental procedures, and the BioMNER dataset provides annotated examples of these "method entities" to train natural language processing (NLP) models.

By having a dataset that is focused specifically on method-related language, the authors hope to improve the performance of NLP models on tasks like intent detection and entity extraction in the biomedical domain. This could ultimately lead to better tools for researchers to navigate and extract insights from the vast amount of published biomedical literature.

Technical Explanation

The paper introduces the BioMNER dataset, which contains annotations for various types of method entities found in biomedical literature, including experimental procedures, data sources, and analysis techniques. The dataset was constructed by manual annotation of a large corpus of PubMed abstracts, with multiple rounds of quality assurance.

The authors also provide baseline results for several state-of-the-art NLP models, including BERT, BioBERT, and VANER, on the BioMNER dataset. These results demonstrate the challenges of accurately identifying method entities in biomedical text and highlight the need for further research and development in this area.

Critical Analysis

The paper provides a valuable contribution to the field of biomedical NLP by introducing a dataset specifically focused on method entities. This is an important step, as previous datasets have often been more general in scope or focused on other types of entities, such as diseases or drugs.

However, the paper does not address some potential limitations of the dataset, such as the potential for bias in the manual annotation process or the coverage of the PubMed corpus used to construct the dataset. Additionally, the baseline results presented for the NLP models suggest that there is still significant room for improvement in accurately identifying method entities, and the paper does not provide much insight into the specific challenges or areas for further research.

Conclusion

Overall, the BioMNER dataset represents an important step forward in the development of biomedical NLP tools, as it provides a focused dataset for training and evaluating models on the task of method entity recognition. The baseline results presented in the paper highlight the challenges in this domain and suggest that further research and development is needed to improve the performance of NLP models on this task. As the field of biomedical NLP continues to evolve, datasets like BioMNER will play a crucial role in driving progress and enabling more effective tools for navigating and extracting insights from the vast body of biomedical literature.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

BioMNER: A Dataset for Biomedical Method Entity Recognition

Chen Tang, Bohao Yang, Kun Zhao, Bo Lv, Chenghao Xiao, Frank Guerin, Chenghua Lin

Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.

7/1/2024

Intent Detection and Entity Extraction from BioMedical Literature

Ankan Mullick, Mukur Gupta, Pawan Goyal

Biomedical queries have become increasingly prevalent in web searches, reflecting the growing interest in accessing biomedical literature. Despite recent research on large-language models (LLMs) motivated by endeavours to attain generalized intelligence, their efficacy in replacing task and domain-specific natural language understanding approaches remains questionable. In this paper, we address this question by conducting a comprehensive empirical evaluation of intent detection and named entity recognition (NER) tasks from biomedical text. We show that Supervised Fine Tuned approaches are still relevant and more effective than general-purpose LLMs. Biomedical transformer models such as PubMedBERT can surpass ChatGPT on NER task with only 5 supervised examples.

4/5/2024

👁️

LLMs in Biomedicine: A study on clinical Named Entity Recognition

Masoud Monajatipoor, Jiaxin Yang, Joel Stremmel, Melika Emami, Fazlolah Mohaghegh, Mozhdeh Rouhsedaghat, Kai-Wei Chang

Large Language Models (LLMs) demonstrate remarkable versatility in various NLP tasks but encounter distinct challenges in biomedical due to the complexities of language and data scarcity. This paper investigates LLMs application in the biomedical domain by exploring strategies to enhance their performance for the NER task. Our study reveals the importance of meticulously designed prompts in the biomedical. Strategic selection of in-context examples yields a marked improvement, offering ~15-20% increase in F1 score across all benchmark datasets for biomedical few-shot NER. Additionally, our results indicate that integrating external biomedical knowledge via prompting strategies can enhance the proficiency of general-purpose LLMs to meet the specialized needs of biomedical NER. Leveraging a medical knowledge base, our proposed method, DiRAG, inspired by Retrieval-Augmented Generation (RAG), can boost the zero-shot F1 score of LLMs for biomedical NER. Code is released at url{https://github.com/masoud-monajati/LLM_Bio_NER}

7/12/2024

👁️

Augmenting Biomedical Named Entity Recognition with General-domain Resources

Yu Yin, Hyunjae Kim, Xiao Xiao, Chih Hsuan Wei, Jaewoo Kang, Zhiyong Lu, Hua Xu, Meng Fang, Qingyu Chen

Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. In this paper, we proposed GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training. Specifically, we performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with multiple additional BioNER datasets. Specifically, our models consistently outperformed the baselines in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight biomedical entity types sourced from five different corpora. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.

6/21/2024