SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Read original: arXiv:2405.16115 - Published 5/28/2024 by Mikhail Kulyabin, Gleb Sokolov, Aleksandr Galaida, Andreas Maier, Tomas Arias-Vergara

SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Overview

Introduces a new benchmark called SNOBERT for evaluating clinical entity linking models on the SNOMED CT terminology
Compares the performance of several state-of-the-art models on this benchmark, including ClinLinker, Efficient Biomedical Entity Linking, and Towards Efficient Patient Recruitment
Finds that models fine-tuned on biomedical data perform better than generic language models, highlighting the importance of domain-specific training for clinical applications

Plain English Explanation

The paper introduces a new benchmark called SNOBERT to evaluate how well AI models can identify and link medical terms from clinical notes to the SNOMED CT terminology. SNOMED CT is a comprehensive clinical terminology used widely in healthcare. The benchmark provides a standardized way to test the performance of different AI models on this task.

The researchers compared several state-of-the-art models, including some that had been specifically trained on biomedical data, like ClinLinker and Efficient Biomedical Entity Linking. They found that these specialized models performed better than more generic language models, suggesting that using domain-specific training data is important for achieving good results on clinical text processing tasks.

Technical Explanation

The paper introduces a new benchmark called SNOBERT for evaluating clinical entity linking models on the SNOMED CT terminology. The benchmark consists of a dataset of clinical notes annotated with SNOMED CT concept mentions.

The authors compare the performance of several state-of-the-art models on the SNOBERT benchmark, including ClinLinker, Efficient Biomedical Entity Linking, and Towards Efficient Patient Recruitment. They also include models fine-tuned on biomedical data, such as BioBERT.

The results show that models fine-tuned on biomedical data outperform generic language models on the SNOBERT benchmark, highlighting the importance of domain-specific training for clinical applications. The paper also discusses the implications of these findings for the development of effective clinical entity linking systems.

Critical Analysis

The paper provides a valuable contribution to the field of clinical text processing by introducing a new benchmark for evaluating entity linking models on the SNOMED CT terminology. The SNOBERT benchmark fills an important gap, as previous benchmarks have often focused on more general biomedical terminologies or lacked a direct connection to the clinical domain.

One potential limitation of the study is the relatively small size of the SNOBERT dataset, which may limit the generalizability of the findings. Additionally, the paper does not explore the performance of the models on specific types of clinical entities or the impact of different entity types on the overall results.

Further research could investigate the robustness of the models to different types of clinical text, such as discharge summaries or progress notes, and explore techniques for improving the performance of entity linking systems in the clinical domain.

Conclusion

The SNOBERT benchmark introduced in this paper provides a valuable tool for evaluating the performance of clinical entity linking models on the SNOMED CT terminology. The results demonstrate the importance of using domain-specific training data for clinical text processing tasks, as models fine-tuned on biomedical data outperformed more generic language models.

These findings have important implications for the development of effective clinical decision support systems and other healthcare applications that rely on accurate entity linking. By providing a standardized benchmark, the paper lays the groundwork for continued progress in this critical area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology

Mikhail Kulyabin, Gleb Sokolov, Aleksandr Galaida, Andreas Maier, Tomas Arias-Vergara

The extraction and analysis of insights from medical data, primarily stored in free-text formats by healthcare workers, presents significant challenges due to its unstructured nature. Medical coding, a crucial process in healthcare, remains minimally automated due to the complexity of medical ontologies and restricted access to medical texts for training Natural Language Processing models. In this paper, we proposed a method, SNOBERT, of linking text spans in clinical notes to specific concepts in the SNOMED CT using BERT-based models. The method consists of two stages: candidate selection and candidate matching. The models were trained on one of the largest publicly available dataset of labeled clinical notes. SNOBERT outperforms other classical methods based on deep learning, as confirmed by the results of a challenge in which it was applied.

5/28/2024

Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques

Akshit Achara, Sanand Sasidharan, Gagan N

Clinical text is rich in information, with mentions of treatment, medication and anatomy among many other clinical terms. Multiple terms can refer to the same core concepts which can be referred as a clinical entity. Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities including the definitions, relations and other corresponding information. These ontologies are used for standardization of clinical text by normalizing varying surface forms of a clinical term through Biomedical entity linking. With the introduction of transformer-based language models, there has been significant progress in Biomedical entity linking. In this work, we focus on learning through synonym pairs associated with the entities. As compared to the existing approaches, our approach significantly reduces the training data and resource consumption. Moreover, we propose a suite of context-based and context-less reranking techniques for performing the entity disambiguation. Overall, we achieve similar performance to the state-of-the-art zero-shot and distant supervised entity linking techniques on the Medmentions dataset, the largest annotated dataset on UMLS, without any domain-based training. Finally, we show that retrieval performance alone might not be sufficient as an evaluation metric and introduce an article level quantitative and qualitative analysis to reveal further insights on the performance of entity linking methods.

5/28/2024

ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish

Fernando Gallego, Guillermo L'opez-Garc'ia, Luis Gasco-S'anchez, Martin Krallinger, Francisco J. Veredas

Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.

4/10/2024

📈

Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model

Mojdeh Rahmanian, Seyed Mostafa Fakhrahmad, Seyedeh Zahra Mousavi

Objective: Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants. Although leveraging electronic health records (EHR) for recruitment has gained popularity, the complex nature of unstructured medical texts presents challenges in efficiently identifying participants. Natural Language Processing (NLP) techniques have emerged as a solution with a recent focus on transformer models. In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task from unstructured medical notes collected in the EHR. Methods: To process the medical records, we selected the most related sentences of the records to the eligibility criteria needed for the trial. The SNOMED CT concepts related to each eligibility criterion were collected. Medical records were also annotated with MedCAT based on the SNOMED CT ontology. Annotated sentences including concepts matched with the criteria-relevant terms were extracted. A prompt-based large language model (Generative Pre-trained Transformer (GPT) in this study) was then used with the extracted sentences as the training set. To assess its effectiveness, we evaluated the model's performance using the dataset from the 2018 n2c2 challenge, which aimed to classify medical records of 311 patients based on 13 eligibility criteria through NLP techniques. Results: Our proposed model showed the overall micro and macro F measures of 0.9061 and 0.8060 which were among the highest scores achieved by the experiments performed with this dataset. Conclusion: The application of a prompt-based large language model in this study to classify patients based on eligibility criteria received promising scores. Besides, we proposed a method of extractive summarization with the aid of SNOMED CT ontology that can be also applied to other medical texts.

4/26/2024