Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

2401.06431

Published 6/18/2024 by Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Abstract

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

Create account to get full access

Overview

This paper explores how large language models (LLMs) can be leveraged to enhance automated essay scoring (AES) systems, transitioning from pure automation to human-AI augmentation.
The authors investigate the potential of LLMs to score essays more accurately and provide detailed feedback, going beyond traditional AES approaches.
The research delves into the technical aspects of using LLMs for essay scoring, as well as the practical implications and challenges of integrating these models into real-world educational settings.

Plain English Explanation

The paper discusses how advanced AI language models can be used to improve the way essays are automatically scored and assessed. Automated essay scoring (AES) systems have been around for a while, but they have limitations in terms of the depth and nuance of their feedback. The authors explore how large language models (LLMs) can be leveraged to take AES to the next level, moving from simple automation to a more collaborative human-AI approach.

The key idea is that LLMs, with their advanced natural language understanding capabilities, can provide more detailed and insightful feedback on student essays. Instead of just generating a score, these models can identify specific strengths and weaknesses in the writing, offer suggestions for improvement, and even engage in interactive dialogues with students to help them refine their work.

The paper delves into the technical details of how LLMs can be fine-tuned and integrated into AES systems. It also explores the potential challenges and limitations of this approach, such as ensuring the rationale and decisions of the models are aligned with human assessors and addressing potential biases in the models.

Overall, the research suggests that the integration of LLMs into automated essay scoring systems could lead to significant advancements in the field, providing students with more valuable and personalized feedback to improve their writing skills.

Technical Explanation

The paper investigates the potential of large language models (LLMs) to enhance the capabilities of automated essay scoring (AES) systems. The authors propose a shift from pure automation to a human-AI augmentation approach, where LLMs are leveraged to provide more nuanced and insightful feedback on student essays.

The researchers explore the technical aspects of integrating LLMs into AES systems. This includes fine-tuning the language models on domain-specific essay datasets and developing strategies to align the models' decision-making with human assessor rationales. The paper also addresses potential challenges, such as mitigating biases in the models and ensuring transparency in the scoring and feedback process.

Through a series of experiments and case studies, the authors demonstrate the superiority of LLM-powered AES over traditional rule-based or machine learning-based approaches. The LLM-based systems are shown to provide more accurate scores, as well as richer and more actionable feedback for students, enabling a more collaborative and iterative writing improvement process.

The paper also explores the practical implications of integrating LLM-based AES into real-world educational settings. It discusses considerations around model deployment, user experience, and the potential impact on teaching and learning practices.

Critical Analysis

The paper presents a well-designed and thoughtful exploration of the potential for large language models to revolutionize the field of automated essay scoring. The authors acknowledge the limitations of current AES systems and make a compelling case for the value of transitioning to a human-AI augmentation approach.

One key strength of the research is the thorough investigation of the technical challenges involved in integrating LLMs into AES. The authors address important issues such as model fine-tuning, bias mitigation, and transparency in decision-making. This level of detail helps to build confidence in the feasibility and robustness of the proposed approach.

However, the paper could be strengthened by a more comprehensive discussion of the potential drawbacks and ethical considerations of LLM-powered AES. While the authors touch on the importance of aligning model decisions with human assessors, there may be additional concerns around privacy, fairness, and the impact on teaching practices that warrant further exploration.

Additionally, the paper would benefit from a more critical analysis of the limitations of the current research. The authors present promising results, but the generalizability and long-term viability of the approach may be influenced by factors such as the size and diversity of the training datasets, the evolution of language models, and the changing needs of educational institutions.

Overall, the research presented in this paper represents an important step forward in the field of automated essay scoring. The integration of large language models has the potential to transform the landscape, but continued scrutiny and responsible development will be essential to ensure the successful deployment of these technologies in real-world educational settings.

Conclusion

This paper explores the exciting potential of leveraging large language models to elevate the capabilities of automated essay scoring systems. By transitioning from pure automation to a human-AI augmentation approach, the authors demonstrate how LLMs can provide more accurate scores, richer feedback, and a more collaborative writing improvement process for students.

The technical details and practical considerations outlined in the research suggest that the integration of LLMs into AES could lead to significant advancements in the field. However, the authors also recognize the need to address important challenges, such as ensuring model transparency, mitigating biases, and aligning the decision-making of the AI systems with human assessors.

As the use of large language models continues to evolve, this paper serves as a valuable contribution to the ongoing discussion around the role of AI in educational assessment and feedback. By striking a balance between the advantages of automation and the nuance of human expertise, the proposed approach holds the promise of enhancing the writing skills and learning experiences of students across a wide range of educational contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

cs.CL cs.AI

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

5/31/2024

cs.AI

💬

Large Language Models as Partners in Student Essay Evaluation

Toru Ishida, Tongxi Liu, Hailong Wang, William K. Cheung

As the importance of comprehensive evaluation in workshop courses increases, there is a growing demand for efficient and fair assessment methods that reduce the workload for faculty members. This paper presents an evaluation conducted with Large Language Models (LLMs) using actual student essays in three scenarios: 1) without providing guidance such as rubrics, 2) with pre-specified rubrics, and 3) through pairwise comparison of essays. Quantitative analysis of the results revealed a strong correlation between LLM and faculty member assessments in the pairwise comparison scenario with pre-specified rubrics, although concerns about the quality and stability of evaluations remained. Therefore, we conducted a qualitative analysis of LLM assessment comments, showing that: 1) LLMs can match the assessment capabilities of faculty members, 2) variations in LLM assessments should be interpreted as diversity rather than confusion, and 3) assessments by humans and LLMs can differ and complement each other. In conclusion, this paper suggests that LLMs should not be seen merely as assistants to faculty members but as partners in evaluation committees and outlines directions for further research.

5/30/2024

cs.CY cs.AI

🌿

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Yupei Wang, Renfen Hu, Zhe Zhao

While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.

5/31/2024

cs.CL