Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

2405.02985

Published 5/7/2024 by Owen Henkel, Adam Boxer, Libby Hills, Bill Roberts

💬

Abstract

This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel, a quizzing platform. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper explores how well Large Language Models (LLMs) can automatically grade open-ended short answer questions across different subject areas and grade levels.
The researchers used a novel dataset from the Carousel quizzing platform to evaluate the performance of various GPT model configurations and prompt engineering strategies.
They found that GPT-4 with basic few-shot prompting performed very close to human-level performance in grading the short answer responses.

Plain English Explanation

The researchers in this study wanted to see how well large language models like GPT-4 could automatically grade or "mark" short answer questions. They used a new dataset of real student responses to short answer questions across science and history topics for different grade levels, from elementary to high school.

The key finding was that GPT-4, with just a simple prompt, was able to grade the short answers almost as well as human experts. This suggests that large language models could be a valuable tool for supporting low-stakes formative assessment in K-12 classrooms, where teachers need to frequently evaluate student understanding but can be overburdened. The proximity to human-level performance across many subjects and grade levels is an important result that builds on prior research showing the potential of these models for this type of task.

Technical Explanation

The researchers conducted a series of experiments using a novel dataset of real student responses to short answer questions from the Carousel quizzing platform. They evaluated how well different configurations of GPT language models, along with various prompt engineering strategies, could grade the open-ended student answers across science and history topics and grade levels spanning ages 5-16.

The key finding was that GPT-4, the most advanced model tested, achieved a Kappa score of 0.70 when grading the short answers - very close to the 0.75 score of human expert raters. This high level of performance, across a diverse range of subject areas and grade levels, suggests that LLMs could be effectively deployed to support automated generation and evaluation of reading comprehension test items in K-12 education.

Critical Analysis

The paper acknowledges several limitations of the research, such as the use of a single dataset from one quizzing platform, which may not generalize to all types of short answer questions. Additionally, the researchers note that the human rater performance may not reflect typical grading practices in real-world classrooms.

One potential concern is the risk of LLMs introducing biases or inconsistencies when grading student responses, which could have negative consequences for equity in assessment. Further research is needed to explore the effectiveness and fairness of using LLMs as graders across a wider range of educational contexts.

Conclusion

This research provides promising evidence that large language models can perform at a level close to human experts when grading open-ended short answer questions, across diverse subject areas and grade levels. If these findings hold true in broader applications, it could have significant implications for supporting more frequent, low-stakes formative assessment in K-12 education, potentially freeing up teacher time and resources. However, further research is needed to address potential concerns around bias and fairness in the use of these models for educational assessment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤖

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

Owen Henkel, Libby Hills, Bill Roberts, Joshua McGrane

Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and don't know. However, grading open-ended questions can be time-consuming leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating of short-answer grading (ASAG), but previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible. This paper investigates the potential for the newest version of LLMs to be used in ASAG, specifically in the grading of short answer questions for formative assessments, in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income North American countries. Second, the paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on grading the novel dataset (QWK 0.92, F1 0.89), reaching near parity with expert human raters. To our knowledge this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data, with low technical hurdles to attaining this performance. These findings suggest that generative LLMs could be used to grade formative literacy assessment tasks.

5/7/2024

cs.CL cs.AI

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

cs.CL cs.AI

🏋️

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Jussi S. Jauhiainen, Agust'in Garagorry Guerra

Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. In our study, we explore the effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in assessing university students' open-ended answers to questions made about reference material they have studied. Each model was instructed to evaluate 54 answers repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used as the framework to make the LLMs to process the evaluation of the answers. As of spring 2024, our analysis revealed notable variations in consistency and the grading outcomes provided by studied LLMs. There is a need to comprehend strengths and weaknesses of LLMs in educational settings for evaluating open-ended written responses. Further comparative research is essential to determine the accuracy and cost-effectiveness of using LLMs for educational assessments.

5/10/2024

cs.CL cs.AI

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

Maja Pavlovic, Massimo Poesio

Large Language Models (LLMs) have emerged as powerful support tools across various natural language tasks and a range of application domains. Recent studies focus on exploring their capabilities for data annotation. This paper provides a comparative overview of twelve studies investigating the potential of LLMs in labelling data. While the models demonstrate promising cost and time-saving benefits, there exist considerable limitations, such as representativeness, bias, sensitivity to prompt variations and English language preference. Leveraging insights from these studies, our empirical analysis further examines the alignment between human and GPT-generated opinion distributions across four subjective datasets. In contrast to the studies examining representation, our methodology directly obtains the opinion distribution from GPT. Our analysis thereby supports the minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

5/3/2024

cs.CL cs.AI cs.LG