Towards LLM-based Autograding for Short Textual Answers

Read original: arXiv:2309.11508 - Published 7/9/2024 by Johannes Schneider, Bernd Schenk, Christina Niklaus

🌿

Overview

Grading exams is a challenging task that involves subjective evaluation of textual responses
The increasing availability of large language models (LLMs) like ChatGPT has raised the feasibility of automating this process
However, using AI for automated grading raises ethical concerns around potential biases and the generation of false information

Plain English Explanation

Grading exams is a crucial but labor-intensive job for educators. It involves carefully reviewing and scoring students' written responses, which can be subjective and repetitive. The recent advancements in large language models have made it possible to automate this process to some degree. These AI models can analyze and evaluate textual answers, potentially saving teachers time and effort.

However, relying on AI for such high-stakes decisions also raises ethical questions. There are concerns that these models may exhibit biases or generate incorrect information that could unfairly impact students' grades. Educators need to carefully consider the reliability and limitations of these AI tools before entrusting them with independent grading.

The researchers in this study aim to evaluate the use of LLMs for automated short answer grading across different languages and courses. Their findings suggest that while these models can provide a helpful complementary perspective, human oversight is still crucial to ensure fair and accurate grading.

Technical Explanation

The researchers conducted an evaluation of using large language models (LLMs) for the purpose of automatically grading short textual responses, also known as automated short answer grading (ASAG). They tested the performance of out-of-the-box LLMs on exam responses across various languages and courses.

The study found that while LLMs can provide a valuable supplementary tool to support educators in validating their grading procedures, the models are not yet ready for fully independent automated grading. There are still limitations and concerns around potential biases and the generation of false information that require human oversight.

The researchers highlight the need for carefully evaluating the capabilities and shortcomings of these AI models before deploying them in high-stakes educational settings. Collaborative human-AI approaches may be a more viable solution, where the models assist and augment the human grading process rather than replacing it entirely.

Critical Analysis

The researchers acknowledge the significant limitations of the current state of LLMs for automated grading and the need for continued human oversight. They highlight the potential for these models to exhibit biases and generate false information, which could have serious consequences in educational settings.

While the study provides a valuable evaluation of LLM performance in ASAG, it is limited to a specific set of languages and courses. Further research would be needed to assess the generalizability of the findings and the performance of these models in a broader range of educational contexts.

Additionally, the paper does not delve deeply into the ethical implications of using AI for high-stakes decision-making in education. It would be beneficial for the research community to further explore the moral and social considerations surrounding the deployment of such technologies, including issues of transparency, accountability, and the potential for exacerbating existing inequities.

Conclusion

This study offers a cautious assessment of the current capabilities of large language models for the purpose of automated short answer grading. While the models show promise as a complementary tool to assist educators, they are not yet ready to replace human graders entirely due to concerns around biases and the generation of false information.

The researchers emphasize the need for continued human oversight and the importance of carefully evaluating the strengths and limitations of these AI systems before deploying them in high-stakes educational settings. Collaborative human-AI approaches may be a more viable solution, leveraging the strengths of both human and machine intelligence to enhance the grading process.

As large language models continue to evolve, the research community will need to navigate the complex ethical and practical considerations surrounding their use in education and other domains. Ongoing evaluation and thoughtful deployment of these technologies will be crucial to ensure they are used in a responsible and beneficial manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

Towards LLM-based Autograding for Short Textual Answers

Johannes Schneider, Bernd Schenk, Christina Niklaus

Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while out-of-the-box LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.

7/9/2024

🤖

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane

Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income North American countries. Second, the paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset (QWK 0.92, F1 0.89), reaching near parity with expert human raters. To our knowledge this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data, with low technical hurdles to attaining this performance. These findings suggest that generative LLMs could be used to grade formative literacy assessment tasks.

5/7/2024

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

5/31/2024

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024