Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Read original: arXiv:2405.19433 - Published 5/31/2024 by Yupei Wang, Renfen Hu, Zhe Zhao

🌿

Overview

Current automated essay scoring (AES) methods show high agreement with human raters, but their scoring mechanisms are not fully explored.
The proposed method uses counterfactual intervention assisted by Large Language Models (LLMs) to reveal that BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, and organization.
This approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions.

Plain English Explanation

Automated essay scoring (AES) systems are computer programs that can grade essays, often with results that closely match human graders. However, the inner workings of these systems are not always clear. Researchers have developed a new method that uses large language models (LLMs) and "counterfactual interventions" to better understand how AES systems make their decisions.

By making small changes to essays and seeing how the scores change, the researchers found that BERT-like models tend to focus mainly on features at the sentence level, such as word choice and grammar. In contrast, the more advanced LLMs also consider higher-level aspects like the overall structure, complexity, and conventions of the writing.

This new approach provides a more comprehensive view of how these AES systems work, which could lead to improvements in their transparency and performance. The researchers believe it could also be applied to other areas where machine learning models make important decisions and need to be better understood.

Technical Explanation

The researchers proposed a method using counterfactual intervention assisted by Large Language Models (LLMs) to investigate the scoring mechanisms of automated essay scoring (AES) systems.

They found that BERT-like models, which are commonly used in AES, primarily focus on sentence-level features such as word choice and grammar when scoring essays. In contrast, the more advanced LLMs are attuned to higher-level aspects of writing, including conventions, language complexity, and organization.

By making small, targeted changes to essays and observing the resulting score changes, the researchers were able to gain insights into how these AES models arrive at their decisions. This approach revealed a more comprehensive alignment between the LLMs' scoring and typical essay scoring rubrics, suggesting that LLMs may be better equipped to provide meaningful feedback to students.

Additionally, the researchers found that LLMs can discern when counterfactual interventions have been made during the feedback process, indicating their potential for providing more transparent and informative scoring.

Critical Analysis

The researchers acknowledge that their study is limited to a specific set of essays and AES systems, and that further research is needed to validate their findings across a wider range of writing samples and model architectures.

Additionally, while the use of counterfactual interventions provides valuable insights, the researchers note that the process of generating and evaluating these interventions can be time-consuming and complex. There may be a need for more efficient or automated methods to make this approach more practical for real-world applications.

It is also worth considering the potential biases or limitations that may be inherent in the training data or model architectures used by the AES systems, and how these factors may influence the scoring decisions. The researchers do not address these issues in depth, but investigating model biases could be an important area for future research.

Conclusion

This study presents a novel approach to understanding the scoring mechanisms of automated essay scoring systems, using counterfactual intervention and Large Language Models. The findings suggest that LLMs may be better equipped to provide comprehensive and transparent feedback, aligning more closely with typical essay scoring rubrics.

The researchers have made their code and data publicly available, which could encourage further exploration and refinement of these techniques. As machine learning models play an increasingly important role in educational and other decision-making domains, understanding their inner workings will be crucial for ensuring fairness, transparency, and accountability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals

Yupei Wang, Renfen Hu, Zhe Zhao

While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.

5/31/2024

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Seungju Kim, Meounggun Jo

Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

7/9/2024