Gene Set Summarization using Large Language Models

Read original: arXiv:2305.13338 - Published 7/8/2024 by Marcin P. Joachimiak, J. Harry Caufield, Nomi L. Harris, Hyeongsik Kim, Christopher J. Mungall

💬

Overview

Molecular biologists often analyze lists of genes to understand their biological functions.
This is typically done using statistical enrichment analysis, which measures whether certain biological function terms are over- or under-represented in the gene list.
The researchers developed a method called SPINDOCTOR that uses large language models (LLMs) like GPT to summarize gene set function as an alternative to enrichment analysis.
SPINDOCTOR can use different sources of gene functional information, including curated ontological annotations, narrative gene summaries, or direct model retrieval.

Plain English Explanation

Molecular biologists frequently work with lists of genes that come from high-throughput experiments or computational analysis. To understand the biological functions of these gene sets, they often perform a statistical technique called enrichment analysis. This looks at whether certain biological function terms (like "cell growth" or "immune response") are over- or under-represented in the gene list, based on a curated knowledge base like the Gene Ontology.

The researchers developed a new method called SPINDOCTOR that uses large language models (LLMs) like GPT to summarize the functions of gene sets. Instead of relying on a curated knowledge base, SPINDOCTOR can use different sources of gene function information, including structured text from ontology annotations, free-text gene summaries, or directly querying the LLM.

The researchers found that SPINDOCTOR can generate plausible and biologically valid summaries of gene set functions. However, the LLM-based approach has some key limitations. It cannot provide reliable statistical scores or p-values, and it often returns terms that are not actually statistically significant. Crucially, SPINDOCTOR struggled to identify the most precise and informative terms that standard enrichment analysis would find, likely because it has difficulty generalizing and reasoning using a structured ontology.

Overall, the results suggest that while LLM-based methods can provide useful complementary information, they are not yet ready to replace traditional enrichment analysis for interpreting gene lists. Manual curation of ontological knowledge remains an important part of this process.

Technical Explanation

The researchers developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. SPINDOCTOR can utilize different sources of gene functional information:

Structured text from curated ontological knowledge base (KB) annotations: SPINDOCTOR can extract descriptions of biological function terms from ontologies like the Gene Ontology (GO) and use these as prompts for the LLM.
Ontology-free narrative gene summaries: Instead of a structured ontology, SPINDOCTOR can use free-text gene function descriptions as input to the LLM.
Direct model retrieval: SPINDOCTOR can directly query the LLM for relevant function terms, without any explicit gene function knowledge.

The researchers found that these SPINDOCTOR methods were able to generate plausible and biologically valid summary GO term lists for gene sets. However, the GPT-based approaches had several key limitations:

They could not deliver reliable statistical scores or p-values, often returning terms that were not statistically significant.
They struggled to recapitulate the most precise and informative terms that standard enrichment analysis would identify, likely due to an inability to effectively generalize and reason using the structure of an ontology.
The results were highly nondeterministic, with minor variations in the prompts resulting in radically different term lists.

Critical Analysis

The researchers acknowledge several important caveats and limitations of their SPINDOCTOR approach:

The LLM-based methods are unable to provide the reliable statistical measures (like p-values) that are a crucial part of standard enrichment analysis. This makes it difficult to distinguish truly significant biological function terms from those that are just plausible-sounding.
SPINDOCTOR struggled to identify the most informative and precise terms that enrichment analysis can uncover, likely because the language models have difficulty reasoning about the structured relationships and hierarchies in ontologies like GO.
The high degree of nondeterminism in the results, where small changes to the prompts lead to very different output, suggests that these methods are not yet robust or reliable enough to replace expert curation.

While the researchers demonstrate some promising capabilities of using LLMs for gene set summarization, the results indicate that manual curation of ontological knowledge remains an essential part of interpreting gene lists. Further research is needed to develop LLM-based methods that can better leverage structured knowledge and provide the statistical rigor required for this type of analysis.

Conclusion

This paper explores the use of large language models like GPT to summarize the biological functions of gene sets as an alternative to traditional statistical enrichment analysis. The researchers developed SPINDOCTOR, a method that can utilize different sources of gene function information to generate summaries.

While SPINDOCTOR was able to produce plausible and biologically valid results, the language model-based approach had several key limitations. It could not provide reliable statistical measures, struggled to identify the most informative terms, and produced highly nondeterministic output. These findings suggest that, at least for now, LLM-based methods are not ready to replace the expert curation and structured reasoning required for interpreting gene lists.

The research highlights the ongoing challenges in applying large language models to the domain of molecular biology and computational biology more broadly. Further advancements will be needed to develop LLM-based techniques that can meaningfully complement or replace traditional knowledge-based approaches in this field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Gene Set Summarization using Large Language Models

Marcin P. Joachimiak, J. Harry Caufield, Nomi L. Harris, Hyeongsik Kim, Christopher J. Mungall

Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB. We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.

7/8/2024

Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph

Vladyslav Nechakhin, Jennifer D'Souza, Steffen Eger

Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhances science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it's essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before application. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance through four unique perspectives: semantic alignment and deviation with ORKG properties, fine-grained properties mapping accuracy, SciNCL embeddings-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.

5/6/2024

High-Throughput Phenotyping of Clinical Text Using Large Language Models

Daniel B. Hier, S. Ilyas Munzir, Anne Stahlfeld, Tayo Obafemi-Ajayi, Michael D. Carrithers

High-throughput phenotyping automates the mapping of patient signs to standardized ontology concepts and is essential for precision medicine. This study evaluates the automation of phenotyping of clinical summaries from the Online Mendelian Inheritance in Man (OMIM) database using large language models. Due to their rich phenotype data, these summaries can be surrogates for physician notes. We conduct a performance comparison of GPT-4 and GPT-3.5-Turbo. Our results indicate that GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs, achieving concordance with manual annotators comparable to inter-rater agreement. Despite some limitations in sign normalization, the extensive pre-training of GPT-4 results in high performance and generalizability across several phenotyping tasks while obviating the need for manually annotated training data. Large language models are expected to be the dominant method for automating high-throughput phenotyping of clinical text.

8/6/2024

💬

New!GP-GPT: Large Language Model for Gene-Phenotype Mapping

Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Xiaowei Yu, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu

Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

9/17/2024