Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph

2405.02105

Published 5/6/2024 by Vladyslav Nechakhin, Jennifer D'Souza, Steffen Eger

Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph

Abstract

Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhances science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it's essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before application. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance through four unique perspectives: semantic alignment and deviation with ORKG properties, fine-grained properties mapping accuracy, SciNCL embeddings-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.

Create account to get full access

Overview

This paper evaluates the use of large language models (LLMs) for structured summarization of scientific papers in the Open Research Knowledge Graph (ORKG).
The ORKG is a platform that aims to capture the key insights and contributions of research papers in a structured format.
The authors assess how well different LLMs, such as GPT-3 and T5, can generate these structured summaries compared to human-written ones.

Plain English Explanation

The paper explores how well large language models can automatically summarize scientific papers in a structured format. The Open Research Knowledge Graph (ORKG) is a platform that aims to capture the key takeaways from research papers in a structured way, making it easier to search and compare insights across the literature.

The researchers tested different language models, like GPT-3 and T5, to see how well they could generate these structured summaries compared to summaries written by humans. The goal is to make the process of curating the ORKG more efficient by automating parts of the summarization task.

Technical Explanation

The paper evaluates the performance of several large language models (LLMs), including GPT-3 and T5, on the task of structured summarization for scientific papers in the Open Research Knowledge Graph (ORKG).

The ORKG is a platform that aims to capture the key contributions and insights from research papers in a structured format, making it easier to search, compare, and build on the existing knowledge in a field. Traditionally, this structuring process has been done manually by domain experts.

The authors assess how well the LLMs can generate these structured summaries, which include elements like the paper's goal, methodology, findings, and implications. They compare the model-generated summaries to human-written ones using both automatic metrics and human evaluations.

The results show that the LLMs are able to produce structured summaries that are reasonably faithful to the original papers, though they still lag behind human experts in terms of coherence, completeness, and faithfulness. The authors discuss the potential of using LLMs to augment and accelerate the curation process for the ORKG.

Critical Analysis

The paper presents a thorough evaluation of LLMs for structured scientific summarization, an important task for enhancing knowledge discovery and synthesis. However, the authors acknowledge several key limitations:

The dataset used for evaluation is still relatively small, which may limit the generalizability of the findings. Expanding the evaluation to a wider range of scientific domains would strengthen the analysis.
The summaries generated by the LLMs, while reasonably faithful, still fall short of human-level coherence and completeness. Further research is needed to close this gap, potentially by incorporating more structured knowledge or reasoning capabilities into the models.
The paper does not explore the potential biases or errors that may arise from using LLMs for this task, which is an important consideration for real-world deployment.

Overall, this research represents an important step towards automating key aspects of the scientific curation process, but continued work is needed to fully realize the potential of LLMs in this domain.

Conclusion

This paper investigates the use of large language models (LLMs) for the task of structured summarization of scientific papers in the context of the Open Research Knowledge Graph (ORKG). The ORKG aims to capture the key insights and contributions of research in a structured format to facilitate knowledge discovery and synthesis.

The authors evaluate several prominent LLMs, including GPT-3 and T5, on their ability to generate these structured summaries and compare them to human-written ones. While the LLMs show promising performance, they still fall short of human-level coherence, completeness, and faithfulness.

This work represents an important step towards automating parts of the scientific curation process, which could significantly improve the discoverability and interconnectedness of research insights. However, further research is needed to address the current limitations and ensure the reliable and responsible use of LLMs for this task.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Yuhao Chen, Zhimu Wang, Bo Wen, Farhana Zulkernine

Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on the performance of different LLMs, specifically for domain-specific data such as medical chart notes. We propose an evaluation approach to analyze the performance of open-source LLMs such as Llama2 and Mistral for medical summarization tasks, using GPT-4 as an assessor. Our innovative approach to quantitative evaluation of LLMs can enable quality control, support the selection of effective LLMs for specific tasks, and advance knowledge discovery in digital health.

5/31/2024

cs.CL cs.LG

💬

Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models

Sowmya S. Sundaram, Benjamin Solomon, Avani Khatri, Anisha Laumas, Purvesh Khatri, Mark A. Musen

Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.

4/10/2024

cs.AI cs.CL cs.IR

💬

New!Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Marcio Fonseca, Shay B. Cohen

In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.

6/28/2024

cs.CL cs.AI

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Luca Foppiano, Guillaume Lambard, Toshiyuki Amagasa, Masashi Ishii

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

6/3/2024

cs.CL