Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs

Read original: arXiv:2408.06752 - Published 8/14/2024 by Mike Thelwall

💬

Overview

Evaluating the quality of academic journal articles is a crucial but time-consuming task for research evaluation exercises, appointments, and promotions.
This study investigates whether Large Language Models (LLMs) like ChatGPT can assist in this process.
The study assesses which ChatGPT inputs (full text, title and abstract, or title only) produce better quality score estimates, and how the scores are affected by different ChatGPT models and system prompts.

Plain English Explanation

The paper explores whether large language models like ChatGPT can be used to assess the quality of academic journal articles. Evaluating the quality of these articles is an important but time-consuming task, often done as part of research evaluation exercises, hiring decisions, and promotion processes.

The researchers tested different ways of providing information to ChatGPT, including the full text of the articles, just the title and abstract, or just the title. They also looked at how the quality scores generated by different versions of ChatGPT (4.0, 3.5-turbo, and 4.0-mini) compared to the scores given by human experts.

The results showed that the best approach was to provide ChatGPT with just the title and abstract of the article. When using this input, the ChatGPT scores correlated quite well (0.67) with the scores given by the human experts - the highest correlation ever reported. The ChatGPT 4.0 model performed slightly better than the 3.5-turbo and 4.0-mini versions.

Interestingly, the researchers found that providing the full text of the articles to ChatGPT seemed to confuse the model and led to less accurate quality assessments. They also discovered that more complex instructions for the task were more effective than simpler ones.

Overall, the findings suggest that while the full text of an article may be needed for a thorough evaluation of its rigor, the title and abstract alone can provide strong indications of its originality and significance. The researchers also found a way to convert the ChatGPT scores to match the human scale, making them 31% more accurate than simply guessing.

Technical Explanation

The study evaluated the performance of large language models like ChatGPT in assessing the quality of academic journal articles. Three different input types were tested: the full text of the article (without tables, figures, and references), the title and abstract, and the title only.

The researchers measured how well the quality scores generated by ChatGPT (using 30 iterations on a dataset of 51 papers) correlated with the scores provided by human experts. They also compared the performance of different ChatGPT models, including 4.0, 3.5-turbo, and 4.0-mini.

The results showed that the optimal input for ChatGPT was the article title and abstract, which produced the highest correlation (0.67) with human scores. This is the highest correlation ever reported for this task. The ChatGPT 4.0 model performed slightly better than the 3.5-turbo and 4.0-mini versions.

Interestingly, providing the full text of the articles to ChatGPT seemed to confuse the model and led to less accurate quality assessments. The researchers also found that more complex system prompts and instructions for the task were more effective than simpler ones.

These findings suggest that while the full text of an article may be needed for a thorough evaluation of its rigor, the title and abstract alone can provide strong indications of its originality and significance. The researchers also developed a linear regression model to convert the ChatGPT scores to the human scale, which was 31% more accurate than guessing.

Critical Analysis

The study provides promising evidence that large language models like ChatGPT can assist in the assessment of academic journal article quality, which is a critical but time-consuming task.

One potential limitation is the relatively small dataset of 51 papers used in the study. Expanding the dataset and validating the findings on a larger scale would help strengthen the conclusions.

Additionally, the study focused on the overall quality assessment, but did not delve into how well ChatGPT can evaluate specific aspects of research quality, such as the rigor of the methodology, the significance of the findings, or the clarity of the writing. Exploring these more granular assessments could provide further insights.

It would also be valuable to investigate how well ChatGPT performs on articles from different academic disciplines, as the writing styles and quality criteria may vary across fields.

Finally, the researchers note that while the title and abstract may provide strong indications of originality and significance, a thorough evaluation of an article's rigor likely requires access to the full text. Exploring ways to combine the strengths of human experts and language models could lead to more efficient and comprehensive research quality assessments.

Conclusion

This study suggests that large language models like ChatGPT can play a valuable role in assessing the quality of academic journal articles, which is a critical but time-consuming task. The best performance was achieved when ChatGPT was provided with just the title and abstract of the articles, with the scores generated by the model correlating quite well (0.67) with those of human experts.

The findings highlight the potential for language models to assist in research evaluation exercises, appointments, and promotions, potentially making these processes more efficient and scalable. However, the study also underscores the need for further research to explore the model's performance on larger datasets, across different academic disciplines, and in evaluating specific aspects of research quality.

By continuing to explore the capabilities and limitations of language models in this domain, researchers and practitioners can work towards developing more effective and comprehensive approaches to assessing the quality of scientific research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs

Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

8/14/2024

💬

From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li, Huaikang Zhou, Mingze Xu

This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.

8/13/2024

💬

Evaluation of the Programming Skills of Large Language Models

Luc Bryan Heitz, Joun Chamas, Christopher Scherb

The advent of Large Language Models (LLM) has revolutionized the efficiency and speed with which tasks are completed, marking a significant leap in productivity through technological innovation. As these chatbots tackle increasingly complex tasks, the challenge of assessing the quality of their outputs has become paramount. This paper critically examines the output quality of two leading LLMs, OpenAI's ChatGPT and Google's Gemini AI, by comparing the quality of programming code generated in both their free versions. Through the lens of a real-world example coupled with a systematic dataset, we investigate the code quality produced by these LLMs. Given their notable proficiency in code generation, this aspect of chatbot capability presents a particularly compelling area for analysis. Furthermore, the complexity of programming code often escalates to levels where its verification becomes a formidable task, underscoring the importance of our study. This research aims to shed light on the efficacy and reliability of LLMs in generating high-quality programming code, an endeavor that has significant implications for the field of software development and beyond.

5/24/2024

🎲

Using ChatGPT to Score Essays and Short-Form Constructed Responses

Mark D. Shermis

This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.

8/20/2024