Automated essay scoring in Arabic: a dataset and analysis of a BERT-based system

Read original: arXiv:2407.11212 - Published 7/17/2024 by Rayed Ghazawi, Edwin Simpson

✅

Overview

This study introduces a new Arabic Automated Essay Scoring (AES) benchmark dataset called AR-AES, which contains 2,046 undergraduate essays from four diverse courses.
The dataset includes gender information, scores, and transparent rubric-based evaluation guidelines, providing insights into the scoring process.
The researchers explore the performance of AraBERT, a pre-trained language model, on different question types for AES.
The study examines the scale of errors made by the BERT-based AES system and compares it to the consistency of human markers.

Plain English Explanation

This research paper focuses on Automated Essay Scoring (AES), which is a technology that can help teachers grade essays more efficiently. The researchers created a new dataset of Arabic essays, called AR-AES, that includes important information like the gender of the writers, the scores the essays received, and the guidelines used to grade them. This provides a comprehensive look at the essay scoring process.

The researchers also tested a pre-trained language model, called AraBERT, to see how well it could grade the essays. They found that the model performed particularly well on essays related to Environmental Chemistry and essays that relied on source materials.

Interestingly, the study looked at the mistakes made by the AI system and compared them to the differences between human graders. They found that the AI system's errors were mostly within one point of the first human grader's score, while the additional human graders often had scores that differed by more than one point from the first grader. This suggests that essay grading can be quite subjective, even for human experts, and that the AI system may be able to grade essays more consistently than people in some cases.

Technical Explanation

The researchers introduced the AR-AES dataset, which contains 2,046 undergraduate essays from four diverse courses, including both traditional and online exams. The dataset includes gender information, scores, and transparent rubric-based evaluation guidelines, providing comprehensive insights into the Arabic essay scoring process.

To explore the performance of AES systems on this dataset, the researchers pioneered the use of AraBERT, a pre-trained Arabic language model. They evaluated AraBERT's performance on different question types, finding encouraging results, particularly for Environmental Chemistry and source-dependent essay questions.

Importantly, the study examined the scale of errors made by the BERT-based AES system, observing that 96.15% of the errors were within one point of the first human marker's prediction (on a scale of 1 to 5), with 79.49% of predictions matching the first marker exactly. In contrast, additional human markers did not exceed 30% exact matches with the first marker, with only 62.9% within one mark.

These findings highlight the subjectivity inherent in essay grading and underscore the potential for current AES technology to assist human markers in grading consistently across large classes, as discussed in research on human-AI collaborative essay scoring.

Critical Analysis

The study provides a valuable contribution to the field of Arabic Automated Essay Scoring by introducing a comprehensive dataset and exploring the performance of a pre-trained language model. However, the researchers acknowledge that their dataset is limited to undergraduate essays and may not be representative of all types of Arabic writing.

Additionally, while the BERT-based AES system showed promising results, the researchers did not investigate the system's ability to provide meaningful feedback or diagnose specific writing issues, which are important capabilities for real-world educational applications. Further research is needed to address these limitations and explore the potential of AI-powered essay grading to enhance the learning experience.

Conclusion

This study made significant strides in advancing the field of Arabic Automated Essay Scoring by introducing a new benchmark dataset and exploring the performance of a pre-trained language model on different essay types. The findings suggest that current AES technology has the potential to assist human graders in providing consistent and timely feedback to students, though more research is needed to fully realize the benefits of this technology in educational settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✅

Automated essay scoring in Arabic: a dataset and analysis of a BERT-based system

Rayed Ghazawi, Edwin Simpson

Automated Essay Scoring (AES) holds significant promise in the field of education, helping educators to mark larger volumes of essays and provide timely feedback. However, Arabic AES research has been limited by the lack of publicly available essay data. This study introduces AR-AES, an Arabic AES benchmark dataset comprising 2046 undergraduate essays, including gender information, scores, and transparent rubric-based evaluation guidelines, providing comprehensive insights into the scoring process. These essays come from four diverse courses, covering both traditional and online exams. Additionally, we pioneer the use of AraBERT for AES, exploring its performance on different question types. We find encouraging results, particularly for Environmental Chemistry and source-dependent essay questions. For the first time, we examine the scale of errors made by a BERT-based AES system, observing that 96.15 percent of the errors are within one point of the first human marker's prediction, on a scale of one to five, with 79.49 percent of predictions matching exactly. In contrast, additional human markers did not exceed 30 percent exact matches with the first marker, with 62.9 percent within one mark. These findings highlight the subjectivity inherent in essay grading, and underscore the potential for current AES technology to assist human markers to grade consistently across large classes.

7/17/2024

Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression

Kun Sun, Rong Wang

Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically score English essays across multiple dimensions by employing fine-tuning and other strategies on two large datasets. The results demonstrate that our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa. Furthermore, our system outperforms existing methods in overall scoring.

6/4/2024

Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring

Haddad Philip, Tsegaye Misikir Tashu

Automatic Essay Scoring (AES) is widely used to evaluate candidates for educational purposes. However, due to the lack of representative data, most existing AES systems are not robust, and their scoring predictions are biased towards the most represented data samples. In this study, we propose a model-agnostic phrase-level method to generate an adversarial essay set to address the biases and robustness of AES models. Specifically, we construct an attack test set comprising samples from the original test set and adversarially generated samples using our proposed method. To evaluate the effectiveness of the attack strategy and data augmentation, we conducted a comprehensive analysis utilizing various neural network scoring models. Experimental results show that the proposed approach significantly improves AES model performance in the presence of adversarial examples and scenarios without such attacks.

9/10/2024

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

6/18/2024