Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Read original: arXiv:2407.05733 - Published 7/9/2024 by Seungju Kim, Meounggun Jo

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Overview

This paper examines whether the powerful GPT-4 language model alone is sufficient for automated essay scoring, or if additional approaches are needed.
The researchers use a comparative judgment method based on how human raters evaluate essays to assess GPT-4's performance.
The findings suggest that while GPT-4 can provide useful insights, it may not be enough on its own for high-quality automated essay scoring, and a hybrid human-AI system could be more effective.

Plain English Explanation

This research paper looks at whether the advanced GPT-4 language model, on its own, is good enough to automatically score and grade student essays. The researchers used a specific method called "comparative judgment" to evaluate how GPT-4 performs compared to how actual human raters would judge essays.

The key idea is that humans don't just look at an essay in isolation - they compare it to other essays and make relative judgments. The researchers wanted to see if GPT-4 could mimic this kind of comparative thinking that human raters use.

The results suggest that while GPT-4 can provide some useful insights, it may not be sufficient on its own for high-quality automated essay scoring. The researchers think a combined approach, where GPT-4 works together with human raters, could be more effective than just relying on the AI model alone. This hybrid human-AI system could leverage the strengths of both to produce better essay evaluations.

Technical Explanation

The paper explores whether the powerful GPT-4 language model alone is sufficient for automated essay scoring, or if additional approaches are needed. The researchers use a comparative judgment method based on how human raters evaluate essays to assess GPT-4's performance.

Specifically, the study looks at how well GPT-4 can replicate the "dual process" of human raters, who both make holistic assessments and consider specific analytic traits like those explored in this paper. The researchers also examine the rationale alignment between GPT-4's scoring and human raters' reasoning.

The findings suggest that while GPT-4 can provide useful insights, it may not be enough on its own for high-quality automated essay scoring. The researchers believe a hybrid human-AI system could be more effective by combining the strengths of both approaches.

Critical Analysis

The paper provides a thoughtful exploration of GPT-4's capabilities for automated essay scoring, acknowledging both its potential and limitations. The comparative judgment methodology is a strength, as it aims to capture the nuanced, context-dependent nature of how humans evaluate essays.

However, the study is limited to a relatively small dataset and specific essay prompts. More research would be needed to generalize the findings and understand how GPT-4 might perform on a wider range of essay types and topics.

Additionally, the paper does not delve deeply into potential biases or fairness concerns that could arise from over-reliance on an AI system for high-stakes essay scoring. Further research on these issues would be valuable.

Overall, the paper makes a compelling case that a hybrid human-AI approach may be more effective than GPT-4 alone for automated essay scoring. Encouraging critical thinking about the limitations of language models in this domain is an important contribution.

Conclusion

This research paper examines the capabilities and limitations of the GPT-4 language model for automated essay scoring. While GPT-4 shows promise, the findings suggest it may not be sufficient on its own for high-quality essay evaluation.

The researchers propose that a hybrid system, combining GPT-4's strengths with human rater cognition, could be a more effective approach. This would leverage the complementary abilities of AI and humans to provide more robust and reliable automated essay scoring.

The paper's comparative judgment methodology and focus on the nuances of human rater decision-making offer valuable insights. Further research is needed to explore these issues more broadly, but this study highlights important considerations for the development of automated essay scoring systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Seungju Kim, Meounggun Jo

Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

7/9/2024

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

Are Large Language Models Good Essay Graders?

Anindita Kundu, Denilson Barbosa

We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.

9/23/2024

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024