EnzChemRED, a rich enzyme chemistry relation extraction dataset

Read original: arXiv:2404.14209 - Published 4/23/2024 by Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi and 9 others

⛏️

Overview

Expert curation is essential for capturing enzyme function knowledge from scientific literature, but it can't keep up with the rapid pace of new discoveries and publications.
This paper presents a new dataset called EnzChemRED to support the development of Natural Language Processing (NLP) methods that can assist enzyme curation.
EnzChemRED consists of 1,210 expert-curated PubMed abstracts with annotations for enzymes and the chemical reactions they catalyze.
The authors show that fine-tuning pre-trained language models on EnzChemRED can significantly improve their ability to identify proteins, chemicals, and the relationships between them.
They combine the best-performing methods to create an end-to-end pipeline for extracting enzyme function knowledge from text at scale, which can guide curation efforts in databases like UniProtKB and Rhea.

Plain English Explanation

Researchers have a lot of information about enzymes and the chemical reactions they enable, but it's a huge challenge to keep up with all the new discoveries being published in scientific papers. Expert curators - people who manually review and organize this information - simply can't work fast enough to capture everything.

To help address this, the researchers created a new dataset called EnzChemRED. This dataset contains 1,210 scientific abstracts (short summaries) that have been carefully annotated by experts to identify the enzymes and chemical reactions mentioned in the text. The researchers used this dataset to train and fine-tune natural language processing (NLP) models, which are computer programs that can analyze and extract information from text.

The results show that these NLP models, when trained on EnzChemRED, got much better at identifying enzymes, chemicals, and the relationships between them. This means they can now assist human curators by quickly scanning large volumes of scientific literature and flagging the most relevant information about enzyme functions.

The researchers combined the best-performing NLP methods into an end-to-end pipeline that can be applied to all the abstracts in the PubMed database (a huge collection of biomedical literature). This allows them to create a comprehensive map of enzyme functions found in the scientific literature, which can then guide the curation efforts of databases like UniProtKB and Rhea.

Technical Explanation

The EnzChemRED dataset was created by expert curators who annotated 1,210 PubMed abstracts with information about the enzymes and chemical reactions mentioned in the text. Specifically, they used identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI) to label the relevant entities.

The researchers then used this dataset to fine-tune pre-trained language models, which are powerful machine learning models that have been trained on large corpora of text data. This fine-tuning process significantly improved the models' ability to perform two key tasks:

Named Entity Recognition (NER): Identifying mentions of proteins (enzymes) and chemicals in the text.
Relation Extraction (RE): Extracting the chemical conversions in which the identified enzymes and chemicals participate.

The best-performing fine-tuned models achieved average F1 scores of 86.30% for NER, 86.66% for RE of chemical conversion pairs, and 83.79% for RE of chemical conversion pairs and linked enzymes.

The researchers then combined these NLP methods into an end-to-end pipeline and applied it to abstracts at the PubMed scale to create a comprehensive map of enzyme functions in the scientific literature. This map can be used to guide curation efforts in databases like UniProtKB and the reaction knowledgebase Rhea.

Critical Analysis

The researchers acknowledge several limitations of their work. First, the EnzChemRED dataset, while a valuable resource, is still relatively small compared to the vast amount of scientific literature available. Expanding the dataset with more annotated abstracts could further improve the performance of the NLP models.

Additionally, the researchers focused on extracting information from abstracts, but many important details about enzyme functions may be buried in the full-text of the articles. Developing methods to extract information from the full-text could lead to even more comprehensive knowledge extraction.

Finally, the researchers did not explore the use of additional external knowledge (e.g., from other databases or ontologies) to further boost the performance of their NLP models. Incorporating such knowledge could be a fruitful area for future research.

Conclusion

This paper presents a valuable new dataset and NLP-based approach to assist expert curation of enzyme function knowledge from the scientific literature. By fine-tuning language models on the EnzChemRED dataset, the researchers were able to significantly improve the ability to identify enzymes, chemicals, and the relationships between them.

The end-to-end pipeline developed in this work can be used to create a comprehensive map of enzyme functions across the scientific literature, which can then guide the curation efforts of databases like UniProtKB and Rhea. This represents an important step towards more efficient and scalable knowledge capture in this domain, ultimately helping to advance our understanding of enzyme-mediated biochemical processes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⛏️

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Catherine Rivoire, Anastasia Sveshnikova, Chih-Hsuan Wei, Robert Leaman, Ling Luo, Zhiyong Lu, Alan Bridge

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

4/23/2024

Reactzyme: A Benchmark for Enzyme-Reaction Prediction

Chenqing Hua, Bozitao Zhong, Sitao Luan, Liang Hong, Guy Wolf, Doina Precup, Shuangjia Zheng

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation.

8/27/2024

🤿

BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction

Bridget T. McInnes, Jiawei Tang, Darshini Mahendran, Mai H. Nguyen

This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improvements, particularly in CPR groups shared between the datasets. The findings underscore the importance of dataset merging in augmenting sample counts and improving model accuracy. Moreover, the study highlights the potential of automated information extraction in biomedical research and clinical practice.

5/30/2024

OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W. Coley, Regina Barzilay

Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of ours attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.

4/3/2024