Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Read original: arXiv:2310.18373 - Published 5/7/2024 by Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane

🤖

Overview

Open-ended questions that require students to provide detailed responses are a valuable tool for formative assessment, as they give teachers a better understanding of what students know and don't know.
However, grading these open-ended questions can be time-consuming, leading teachers to use simpler question formats or conduct fewer formative assessments.
Researchers have long been interested in automating the grading of short-answer responses (Can Large Language Models Make the Grade?), but previous approaches have been technically complex, limiting their use in formative assessment contexts.
The newest generation of Large Language Models (LLMs) may make grading short-answer questions more feasible.

Plain English Explanation

This research paper explores the potential of using the latest LLMs, such as GPT-4, to automatically grade short-answer questions for formative assessments. Formative assessments are tests or quizzes that teachers use throughout the learning process to gauge what students have understood and where they may be struggling.

Open-ended questions, where students have to provide multi-word answers, are a common type of formative assessment. These questions give teachers more detailed insights into students' knowledge compared to simple multiple-choice or true/false questions. However, grading these open-ended responses can be very time-consuming for teachers.

Researchers have tried to solve this problem by developing systems to automatically grade short-answer questions, but these systems have often been technically complex, making them difficult for teachers to use.

The researchers in this paper believed that the newest generation of powerful language models, like GPT-4, might be able to grade short-answer questions more effectively and with less complexity. To test this, they:

Created a new dataset of short-answer reading comprehension questions from students in Ghana, providing a new context to evaluate LLMs beyond the typical data from high-income North American countries.
Tested how well various LLM configurations could grade the short-answer responses compared to expert human raters.

The results showed that GPT-4, with minimal prompting, was able to grade the short-answer responses extremely well, nearly matching the performance of expert human raters. This suggests that advanced LLMs could potentially be used by teachers to efficiently grade formative assessments, saving time and allowing them to gather more detailed insights into student learning.

Technical Explanation

The researchers first created a novel dataset of short-answer reading comprehension questions. This dataset was drawn from a set of reading assessments conducted with over 150 students in Ghana, providing a new context to evaluate LLMs beyond the typical North American data they are usually trained on.

Next, the researchers empirically evaluated how well various configurations of generative LLMs, such as GPT-4, could grade the student short-answer responses compared to expert human raters. They used metrics like Quadratic Weighted Kappa (QWK) and F1 score to measure the performance of the LLMs.

The findings showed that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset, reaching a QWK of 0.92 and an F1 score of 0.89. This performance was near-parity with expert human raters.

Critical Analysis

The researchers acknowledge that their dataset, while novel, is relatively small, and further evaluation on larger and more diverse datasets would be beneficial. Additionally, they note that the performance of the LLMs may depend on the specific prompting and fine-tuning approaches used, and more research is needed to explore the optimal configurations for ASAG tasks.

One potential concern not addressed in the paper is the risk of LLMs perpetuating biases present in the training data, which could lead to unfair or inaccurate grading for certain student populations. Further research is needed to explore the effectiveness of LLMs as annotators and ensure their fairness and reliability in educational assessment contexts.

Conclusion

This research provides promising evidence that the newest generation of LLMs, such as GPT-4, could be effectively used to grade short-answer questions for formative assessments, potentially saving teachers time and allowing them to gather more detailed insights into student learning. While further research is needed to address potential limitations and concerns, this work represents an important step forward in the automated grading of proficiency-based assessments using advanced language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤖

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane

Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income North American countries. Second, the paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset (QWK 0.92, F1 0.89), reaching near parity with expert human raters. To our knowledge this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data, with low technical hurdles to attaining this performance. These findings suggest that generative LLMs could be used to grade formative literacy assessment tasks.

5/7/2024

💬

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Owen Henkel, Adam Boxer, Libby Hills, Bill Roberts

This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel, a quizzing platform. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.

5/7/2024

🌿

Towards LLM-based Autograding for Short Textual Answers

Johannes Schneider, Bernd Schenk, Christina Niklaus

Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while out-of-the-box LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.

7/9/2024

🏋️

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Jussi S. Jauhiainen, Agust'in Garagorry Guerra

Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. In our study, we explore the effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in assessing university students' open-ended answers to questions made about reference material they have studied. Each model was instructed to evaluate 54 answers repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used as the framework to make the LLMs to process the evaluation of the answers. As of spring 2024, our analysis revealed notable variations in consistency and the grading outcomes provided by studied LLMs. There is a need to comprehend strengths and weaknesses of LLMs in educational settings for evaluating open-ended written responses. Further comparative research is essential to determine the accuracy and cost-effectiveness of using LLMs for educational assessments.

5/10/2024