Large Language Models as Partners in Student Essay Evaluation

2405.18632

Published 5/30/2024 by Toru Ishida, Tongxi Liu, Hailong Wang, William K. Cheung

💬

Abstract

As the importance of comprehensive evaluation in workshop courses increases, there is a growing demand for efficient and fair assessment methods that reduce the workload for faculty members. This paper presents an evaluation conducted with Large Language Models (LLMs) using actual student essays in three scenarios: 1) without providing guidance such as rubrics, 2) with pre-specified rubrics, and 3) through pairwise comparison of essays. Quantitative analysis of the results revealed a strong correlation between LLM and faculty member assessments in the pairwise comparison scenario with pre-specified rubrics, although concerns about the quality and stability of evaluations remained. Therefore, we conducted a qualitative analysis of LLM assessment comments, showing that: 1) LLMs can match the assessment capabilities of faculty members, 2) variations in LLM assessments should be interpreted as diversity rather than confusion, and 3) assessments by humans and LLMs can differ and complement each other. In conclusion, this paper suggests that LLMs should not be seen merely as assistants to faculty members but as partners in evaluation committees and outlines directions for further research.

Create account to get full access

Overview

This paper explores the use of Large Language Models (LLMs) for efficient and fair assessment in workshop courses, which have a growing need for comprehensive evaluation methods that reduce faculty workload.
The researchers conducted an evaluation of LLM assessments in three scenarios: without guidance, with pre-specified rubrics, and through pairwise comparison of essays.
The results showed a strong correlation between LLM and faculty assessments in the pairwise comparison scenario with pre-specified rubrics, but also raised concerns about the quality and stability of LLM evaluations.

Plain English Explanation

The paper looks at how Large Language Models (LLMs) could be used to help assess student essays in workshop courses. Workshop courses often require a lot of grading and feedback from professors, which can be time-consuming. The researchers wanted to see if LLMs could help with this process and provide fair, efficient assessments.

They tested the LLMs in three different ways:

Without giving the LLMs any guidance or rubrics to follow
Providing the LLMs with pre-set rubrics to use
Asking the LLMs to compare pairs of essays and assess them relative to each other

The results showed that when the LLMs had the rubrics to work with, their assessments correlated strongly with the assessments done by human professors. This suggests the LLMs can match the assessment capabilities of the professors in certain scenarios.

However, the researchers also found some concerns about the quality and consistency of the LLM evaluations. They did a deeper analysis of the LLM comments and found that the variations in the assessments shouldn't be seen as confusion, but rather as a diversity of perspectives. The LLM assessments were also able to complement the human assessments in useful ways.

Overall, the paper suggests that LLMs shouldn't just be seen as assistants to professors, but as partners in the evaluation process. The researchers outline areas for further research to continue exploring how LLMs can be effectively used for assessment in education.

Technical Explanation

The paper presents an evaluation of using Large Language Models (LLMs) to assess student essays in workshop courses. The researchers conducted the evaluation in three scenarios:

Without Guidance: LLMs were asked to assess essays without any provided rubrics or guidance.
With Pre-Specified Rubrics: LLMs were given pre-determined rubrics to use in their assessments.
Pairwise Comparison: LLMs were asked to compare pairs of essays and assess them relative to each other.

Quantitative analysis revealed a strong correlation between LLM and faculty assessments in the pairwise comparison scenario with pre-specified rubrics. However, concerns remained about the quality and stability of the LLM evaluations.

The researchers then conducted a qualitative analysis of the LLM assessment comments, which showed that:

LLMs can match the assessment capabilities of faculty members when provided with appropriate guidance.
Variations in LLM assessments should be interpreted as diversity rather than confusion.
Assessments by humans and LLMs can differ and complement each other.

Critical Analysis

The paper presents a promising approach to leveraging LLMs for automated essay scoring, but also highlights important limitations and areas for further research.

One key limitation is the concern about the quality and stability of the LLM evaluations, even in the pairwise comparison scenario with rubrics. This suggests that more work is needed to improve the consistency and reliability of LLM-based assessment.

Additionally, the paper does not address potential biases or fairness issues that may arise from using LLMs for assessment. As previous research has shown, LLMs can exhibit biases that could unfairly impact student evaluations.

Further research is needed to explore how LLMs can be effectively integrated into the feedback and assessment process, rather than just replacing human evaluators. The complementary nature of human and LLM assessments highlighted in the paper suggests that a hybrid approach may be most beneficial.

Conclusion

This paper explores the promising potential of using Large Language Models (LLMs) for efficient and fair assessment in workshop courses, where comprehensive evaluation is increasingly important. The results suggest that LLMs can match the assessment capabilities of faculty members in certain scenarios, but also highlight concerns about the quality and stability of the LLM evaluations.

The qualitative analysis provides valuable insights, suggesting that LLMs should be viewed as partners in the evaluation process rather than just assistants. Further research is needed to address issues of consistency, reliability, and fairness, as well as to explore how LLMs can be effectively integrated into the broader assessment and feedback ecosystem in education.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Analyzing Large Language Models for Classroom Discussion Assessment

Nhat Tran, Benjamin Pierce, Diane Litman, Richard Correnti, Lindsay Clare Matsumura

Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.

6/14/2024

cs.CL

💬

Apprentices to Research Assistants: Advancing Research with Large Language Models

M. Namvarpour, A. Razi

Large Language Models (LLMs) have emerged as powerful tools in various research domains. This article examines their potential through a literature review and firsthand experimentation. While LLMs offer benefits like cost-effectiveness and efficiency, challenges such as prompt tuning, biases, and subjectivity must be addressed. The study presents insights from experiments utilizing LLMs for qualitative analysis, highlighting successes and limitations. Additionally, it discusses strategies for mitigating challenges, such as prompt optimization techniques and leveraging human expertise. This study aligns with the 'LLMs as Research Tools' workshop's focus on integrating LLMs into HCI data work critically and ethically. By addressing both opportunities and challenges, our work contributes to the ongoing dialogue on their responsible application in research.

4/10/2024

cs.HC cs.AI cs.LG

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

5/31/2024

cs.AI

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024

cs.CL cs.AI