Grade Like a Human: Rethinking Automated Assessment with Large Language Models

2405.19694

Published 5/31/2024 by Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Abstract

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

Create account to get full access

Overview

This paper explores the potential of large language models (LLMs) to assist with automated assessment and feedback for student essays.
The researchers investigate whether LLMs can match or even surpass human graders in evaluating student writing.
They also examine how LLMs can provide customized feedback to help students improve their work.

Plain English Explanation

The paper looks at using powerful AI language models, known as large language models (LLMs), to automatically grade student essays and provide feedback. The goal is to see if these AI systems can match or even outperform human graders when it comes to evaluating student writing.

The researchers believe LLMs could be a game-changer for education by automating the time-consuming task of grading essays. This could free up teachers to focus more on instruction and personalized support. Additionally, the AI could potentially provide more detailed and tailored feedback to help students improve their writing skills. This type of intelligent feedback system could be very valuable for students.

Overall, the researchers want to understand the strengths and limitations of using LLMs for automated essay grading and feedback. If successful, this could revolutionize how we assess student writing at scale.

Technical Explanation

The paper reports on a series of experiments exploring the use of LLMs, specifically GPT-3, for automated essay grading and feedback generation. The researchers first fine-tuned the GPT-3 model on a large dataset of student essays and their corresponding human-assigned grades.

They then tested the fine-tuned model's ability to score new essays, comparing its performance to that of human graders. The study found that the LLM-based grading system was able to match the accuracy of human graders, and in some cases, even outperform them.

Furthermore, the researchers investigated using the LLM to generate personalized feedback for students. By prompting the model with the student's essay and a set of feedback guidelines, they were able to generate feedback that was rated as relevant and helpful by both students and teachers.

Critical Analysis

The research presented in this paper is promising, but it also acknowledges several limitations and areas for further exploration. For example, the dataset used for fine-tuning the GPT-3 model was relatively small, and the researchers note that larger and more diverse datasets may be needed to ensure robust and unbiased performance.

Additionally, the paper highlights the need to further investigate the ability of LLMs to provide nuanced, context-sensitive feedback that can effectively guide student improvement. While the preliminary results are encouraging, more work is needed to refine the feedback generation capabilities of these models.

Overall, this research represents an important step in exploring the potential of LLMs to revolutionize automated assessment and feedback in education. However, continued research and careful consideration of the ethical implications will be crucial as these technologies continue to develop.

Conclusion

This paper presents a compelling case for using large language models (LLMs) to automate the assessment and feedback process for student essays. The researchers demonstrate that LLMs can match or even surpass human graders in accurately scoring student writing, opening up the possibility of scalable, consistent, and efficient essay evaluation.

Furthermore, the findings suggest that LLMs can generate personalized feedback that is perceived as relevant and helpful by both students and teachers. This could be a transformative development, providing students with tailored guidance to improve their writing skills.

While this research is promising, the authors also acknowledge the need for further refinement and exploration of the limitations and ethical considerations. As these technologies continue to advance, it will be crucial to ensure they are developed and deployed in a responsible manner that prioritizes student learning and growth.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Large Language Models as Partners in Student Essay Evaluation

Toru Ishida, Tongxi Liu, Hailong Wang, William K. Cheung

As the importance of comprehensive evaluation in workshop courses increases, there is a growing demand for efficient and fair assessment methods that reduce the workload for faculty members. This paper presents an evaluation conducted with Large Language Models (LLMs) using actual student essays in three scenarios: 1) without providing guidance such as rubrics, 2) with pre-specified rubrics, and 3) through pairwise comparison of essays. Quantitative analysis of the results revealed a strong correlation between LLM and faculty member assessments in the pairwise comparison scenario with pre-specified rubrics, although concerns about the quality and stability of evaluations remained. Therefore, we conducted a qualitative analysis of LLM assessment comments, showing that: 1) LLMs can match the assessment capabilities of faculty members, 2) variations in LLM assessments should be interpreted as diversity rather than confusion, and 3) assessments by humans and LLMs can differ and complement each other. In conclusion, this paper suggests that LLMs should not be seen merely as assistants to faculty members but as partners in evaluation committees and outlines directions for further research.

5/30/2024

cs.CY cs.AI

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024

cs.CL cs.AI

💬

Grading Massive Open Online Courses Using Large Language Models

Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger

Massive open online courses (MOOCs) offer free education globally to anyone with a computer and internet access. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for one instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. Specifically, we use two LLMs, GPT-4 and GPT-3.5, across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. To instruct LLMs, we use three different prompts based on the zero-shot chain-of-thought (ZCoT) prompting technique: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. Tested on 18 settings, our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.

6/18/2024

cs.CL cs.AI

Investigating Automatic Scoring and Feedback using Large Language Models

Gloria Ashiya Katuka, Alexander Gain, Yen-Yun Yu

Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.

5/2/2024

cs.CL cs.LG