Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection

Read original: arXiv:2404.08655 - Published 4/16/2024 by Sourya Dipta Das, Yash Vadi, Kuldeep Yadav

Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection

Overview

This research paper presents a transformer-based joint modeling approach for automatically scoring essays and detecting off-topic responses.
The proposed model aims to address the challenges of essay scoring and off-topic detection, which are crucial tasks in educational assessment and language learning.
The authors explore the benefits of using a single transformer-based model to tackle both tasks simultaneously, leveraging the shared context and dependencies between them.

Plain English Explanation

The research paper discusses a new way to automatically evaluate and score written essays, as well as detect if the essay is off-topic or not relevant to the assigned prompt. This is an important task in education, as it can help assess a student's writing skills and ensure they are responding to the correct prompt.

The researchers developed a model based on transformer technology, which is a type of artificial intelligence that can understand and generate human language. This transformer-based model is designed to handle both the essay scoring and off-topic detection tasks at the same time, rather than treating them as separate problems.

The key idea is that by jointly modeling these two tasks, the model can learn the relationships and dependencies between them, and potentially improve the performance on each task. For example, if an essay is off-topic, it's likely to receive a lower score, so the model can use this connection to better identify off-topic essays and provide more accurate scores.

The researchers test their joint modeling approach on various essay datasets and compare it to other state-of-the-art methods. Their results suggest that this unified approach can outperform models that treat the tasks separately, demonstrating the benefits of leveraging the synergies between essay scoring and off-topic detection.

Technical Explanation

The paper proposes a Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection. The authors argue that essay scoring and off-topic detection are closely related tasks that can benefit from joint modeling, as they share common contextual information and dependencies.

The core of the proposed approach is a transformer-based model that is trained to perform both essay scoring and off-topic detection simultaneously. This joint modeling strategy allows the model to learn the shared representations and relationships between the two tasks, potentially leading to improved performance compared to treating them as separate problems.

The authors experiment with different transformer architectures, such as BERT and RoBERTa, and evaluate their joint modeling approach on several essay datasets, including the ASAP-AES and TOEFL-450 datasets. They compare their results to state-of-the-art methods for essay scoring and off-topic detection, including GPT-3-based approaches and specialized off-topic detection models.

The results demonstrate that the joint modeling approach can outperform the independent models on both tasks, highlighting the benefits of leveraging the synergies between essay scoring and off-topic detection. The authors also provide insights into the model's performance on different types of essays and off-topic responses, as well as the role of task-specific fine-tuning in the joint modeling framework.

Critical Analysis

The paper presents a well-designed and thorough investigation of the joint modeling approach for essay scoring and off-topic detection. The authors make a compelling case for the potential benefits of this unified framework, as the two tasks are inherently related and can potentially benefit from shared representations and knowledge.

However, the paper does not extensively discuss the limitations or potential drawbacks of the joint modeling approach. For instance, it would be valuable to understand how the model performs on edge cases or on specific types of off-topic responses that may be challenging to detect. Additionally, the paper could have explored the tradeoffs between joint modeling and separate task-specific models, as there may be scenarios where the latter approach could be more suitable.

Further research could also investigate the interpretability and explainability of the joint model, as understanding the model's decision-making process could lead to insights about the relationships between essay scoring and off-topic detection.

Overall, the paper presents a promising approach and valuable contributions to the field of automated essay evaluation. However, a more comprehensive discussion of the limitations and areas for future research would strengthen the critical analysis and help readers better understand the broader implications and potential applications of this work.

Conclusion

This research paper introduces a transformer-based joint modeling approach for automatically scoring essays and detecting off-topic responses. The key innovation is the use of a single model that can handle both tasks simultaneously, leveraging the shared contextual information and dependencies between them.

The results demonstrate the benefits of this unified framework, as the joint model outperforms independent models on both essay scoring and off-topic detection. This suggests that there are synergies between these two tasks that can be effectively exploited by a single transformer-based architecture.

The potential impact of this work lies in its ability to enhance educational assessment and language learning applications, where accurate and reliable essay evaluation is crucial. By combining essay scoring and off-topic detection into a single model, the proposed approach can streamline the assessment process and provide more comprehensive and insightful feedback to students and educators.

As the field of natural language processing continues to advance, this research highlights the value of exploring joint modeling strategies that can leverage the interdependencies between related tasks. The findings presented in this paper pave the way for further advancements in automated essay evaluation and other areas of educational technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection

Sourya Dipta Das, Yash Vadi, Kuldeep Yadav

Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write unrelated text responses to the given task in the question. In this paper, we are proposing an unsupervised technique that jointly scores essays and detects off-topic essays. The proposed Automated Open Essay Scoring (AOES) model uses a novel topic regularization module (TRM), which can be attached on top of a transformer model, and is trained using a proposed hybrid loss function. After training, the AOES model is further used to calculate the Mahalanobis distance score for off-topic essay detection. Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets in off-topic detection as well as on-topic scoring. Experimental evaluation results on different adversarial strategies also show how the suggested method is robust for detecting possible human-level perturbations.

4/16/2024

Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression

Kun Sun, Rong Wang

Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically score English essays across multiple dimensions by employing fine-tuning and other strategies on two large datasets. The results demonstrate that our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa. Furthermore, our system outperforms existing methods in overall scoring.

6/4/2024

Can Large Language Models Automatically Score Proficiency of Written Essays?

Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.

4/17/2024

✅

Automated essay scoring in Arabic: a dataset and analysis of a BERT-based system

Rayed Ghazawi, Edwin Simpson

Automated Essay Scoring (AES) holds significant promise in the field of education, helping educators to mark larger volumes of essays and provide timely feedback. However, Arabic AES research has been limited by the lack of publicly available essay data. This study introduces AR-AES, an Arabic AES benchmark dataset comprising 2046 undergraduate essays, including gender information, scores, and transparent rubric-based evaluation guidelines, providing comprehensive insights into the scoring process. These essays come from four diverse courses, covering both traditional and online exams. Additionally, we pioneer the use of AraBERT for AES, exploring its performance on different question types. We find encouraging results, particularly for Environmental Chemistry and source-dependent essay questions. For the first time, we examine the scale of errors made by a BERT-based AES system, observing that 96.15 percent of the errors are within one point of the first human marker's prediction, on a scale of one to five, with 79.49 percent of predictions matching exactly. In contrast, additional human markers did not exceed 30 percent exact matches with the first marker, with 62.9 percent within one mark. These findings highlight the subjectivity inherent in essay grading, and underscore the potential for current AES technology to assist human markers to grade consistently across large classes.

7/17/2024