Can GPT-4 do L2 analytic assessment?

Read original: arXiv:2404.18557 - Published 4/30/2024 by Stefano Bann`o, Hari Krishna Vydana, Kate M. Knill, Mark J. F. Gales

🤿

Overview

Automated essay scoring (AES) has been used for decades to evaluate second language (L2) proficiency in educational contexts.
While holistic scoring in AES has advanced to match or exceed human performance, analytic scoring still faces issues inherited from the human scoring process.
The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency.

Plain English Explanation

Automated essay scoring (AES) is a technology that has been used for a long time to assess how well people can write in a second language (L2) for educational purposes. AES can provide holistic scores that are as good as or better than scores given by human experts. However, the more detailed "analytic" scores that break down different aspects of writing proficiency still have some problems, as they inherit limitations from the human scoring process.

The development of large language models, such as GPT-4, has opened up new possibilities for automatically evaluating specific components of L2 writing ability. In this study, the researchers wanted to see if they could use GPT-4 in a "zero-shot" way (without any additional training) to extract detailed information about the underlying analytic components of L2 writing proficiency, based on a publicly available dataset that already had holistic scores.

Technical Explanation

The researchers performed a series of experiments using the GPT-4 large language model in a zero-shot fashion on a publicly available dataset. This dataset contained essays annotated with holistic scores based on the Common European Framework of Reference (CEFR) for language proficiency.

The goal was to see if GPT-4 could automatically predict analytic scores for different aspects of L2 writing proficiency, without any additional training or fine-tuning. The researchers looked at how the GPT-4-generated analytic scores correlated with various features associated with the individual proficiency components.

The results showed significant correlations between the automatically predicted analytic scores and multiple features linked to the individual writing proficiency components. This suggests that large language models like GPT-4 have the potential to be used for a more detailed, automated evaluation of L2 writing skills, going beyond just holistic scoring.

Critical Analysis

The paper presents an interesting exploration of using large language models, such as GPT-4, to automate the assessment of specific aspects of second language writing proficiency. This could be a valuable tool for educational applications, as it could provide more detailed and nuanced feedback to students and teachers.

However, the research is still in a preliminary stage, and the authors acknowledge several limitations. For example, the study was conducted on a single dataset, and the performance of the model may vary depending on the quality and characteristics of the training data. Further research is needed to evaluate the model's performance across a wider range of datasets and assessment frameworks.

Additionally, the zero-shot approach used in this study may not capture the full potential of large language models for this task. Exploring fine-tuning or other adaptation techniques could potentially improve the model's ability to accurately assess specific proficiency components.

Conclusion

This paper demonstrates the potential of large language models, such as GPT-4, to automate the evaluation of second language writing proficiency in a more detailed, analytic way, going beyond just holistic scoring. The significant correlations between the model's predicted analytic scores and various features of writing proficiency are a promising result.

However, further research is needed to fully understand the capabilities and limitations of this approach, as well as to explore ways to optimize the model's performance. If successful, this technology could lead to more personalized and effective feedback for language learners, ultimately supporting their development of writing skills in a second language.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Can GPT-4 do L2 analytic assessment?

Stefano Bann`o, Hari Krishna Vydana, Kate M. Knill, Mark J. F. Gales

Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.

4/30/2024

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Seungju Kim, Meounggun Jo

Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

7/9/2024

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024