Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Read original: arXiv:2407.18328 - Published 7/29/2024 by Xuansheng Wu, Padmaja Pravin Saraf, Gyeong-Geon Lee, Ehsan Latif, Ninghao Liu, Xiaoming Zhai

🏋️

Overview

This paper explores the differences between how large language models (LLMs) and human graders assess and score written responses.
The researchers conducted experiments to understand the scoring processes and decision-making of LLMs compared to expert human graders.
Key findings include insights into the strengths and limitations of LLM-based scoring approaches and the potential for human-AI collaboration in automated assessment.

Plain English Explanation

The paper looks at how AI systems that use large language models (LLMs) score or grade written responses, and how this compares to how human experts grade the same responses. The researchers did experiments to understand the different ways LLMs and human graders assess and make decisions about the quality of written work.

The main findings provide insights into the strengths and weaknesses of using LLMs for automated scoring. For example, LLMs may excel at certain aspects like assessing grammar and structure, but struggle with more nuanced evaluation of content and reasoning like human graders. This suggests there could be value in having humans and AI systems work together on grading, combining their respective strengths.

Overall, the research aims to shed light on the "black box" of automated scoring systems, to better understand how they differ from human evaluation. This could lead to improvements in AI-based grading and assessment tools, and explore new ways for humans and machines to collaborate on evaluating written work.

Technical Explanation

The paper presents an empirical study that compares the scoring processes and decision-making of large language models (LLMs) and human expert graders when evaluating written responses. The researchers conducted experiments using a dataset of student responses to open-ended prompts, which were scored by both an LLM-based system and a panel of human raters.

To understand the underlying scoring mechanisms, the researchers analyzed various aspects of the grading process, including:

The features and criteria used by LLMs and humans to assess response quality
The degree of alignment between LLM scores and human scores
The types of errors or biases exhibited by the LLM-based system compared to human graders

The findings indicate that while LLMs can achieve comparable overall scoring performance to humans, there are significant differences in the specific factors and decision-making processes applied. For example, LLMs may rely more heavily on surface-level linguistic features, while human graders take a more holistic, content-focused approach.

These insights suggest that LLM-based scoring systems and human graders offer complementary strengths, and that a hybrid human-AI approach could leverage the best of both to improve automated assessment. The research also highlights the importance of understanding and addressing potential biases or limitations in AI-powered scoring systems.

Critical Analysis

The paper provides a thorough and thoughtful analysis of the differences between LLM-based scoring and human grading. The experimental design and methodological approach seem sound, and the findings offer valuable insights into the inner workings of automated assessment systems.

One potential limitation is that the study focuses on a specific dataset and task, so the generalizability of the results to other types of written responses or assessment contexts may be limited. Additional research is needed to explore the broader applicability of these findings and whether the patterns hold true across a wider range of scenarios.

The paper also does not delve deeply into the implications of the observed biases and errors in LLM-based scoring. Further investigation is warranted to understand the potential impact on student learning and fairness in assessment, and to develop strategies for mitigating such issues.

Overall, this study represents an important step forward in unveiling the "black box" of automated scoring, and in exploring the possibilities for fruitful collaboration between humans and AI systems in the context of written assessment.

Conclusion

This paper offers valuable insights into the differences between how large language models (LLMs) and human experts approach the task of scoring written responses. The key findings suggest that while LLMs can achieve comparable overall scoring performance, there are significant differences in the specific factors and decision-making processes they apply compared to human graders.

These insights have important implications for the development and deployment of automated assessment systems, as they highlight both the strengths and limitations of LLM-based approaches. The research suggests that a hybrid human-AI approach, combining the complementary strengths of LLMs and human graders, could be a promising direction for improving the quality and fairness of automated assessment.

Overall, this study contributes to a better understanding of the "black box" of automated scoring, and paves the way for further research and innovation in the field of educational technology and assessment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Xuansheng Wu, Padmaja Pravin Saraf, Gyeong-Geon Lee, Ehsan Latif, Ninghao Liu, Xiaoming Zhai

Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans, or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy. Specifically, we prompt LLMs to generate analytic rubrics that they use to assign scores and study the alignment gap with human grading rubrics. Based on a series of experiments with various configurations of LLM settings, we reveal a notable alignment gap between human and LLM graders. While LLMs can adapt quickly to scoring tasks, they often resort to shortcuts, bypassing deeper logical reasoning expected in human grading. We found that incorporating high-quality analytical rubrics designed to reflect human grading logic can mitigate this gap and enhance LLMs' scoring accuracy. These results caution against the simplistic application of LLMs in science education and highlight the importance of aligning LLM outputs with human expectations to ensure efficient and accurate automatic scoring.

7/29/2024

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

5/31/2024

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024

🌿

Towards LLM-based Autograding for Short Textual Answers

Johannes Schneider, Bernd Schenk, Christina Niklaus

Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while out-of-the-box LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.

7/9/2024