Graded Relevance Scoring of Written Essays with Dense Retrieval

Read original: arXiv:2405.05200 - Published 5/9/2024 by Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

🔄

Overview

This paper proposes a novel approach for automatically scoring the relevance of student essays, a key aspect of essay quality.
The method uses dense retrieval encoders to create embeddings that cluster essays by relevance level, allowing a simple 1-nearest neighbor classifier to determine the relevance score.
The researchers leverage the Contriever unsupervised dense encoder and test their approach on the ASAP++ dataset.
Their method achieves state-of-the-art performance on task-specific scenarios and performs on par with the best cross-task models.
The approach also shows promise in reducing labeling costs for practical few-shot scenarios.

Plain English Explanation

Grading student essays can be a time-consuming process for teachers. Automated Essay Scoring aims to automate this task, helping improve student writing skills. While previous research has focused on holistic essay scoring, this paper zeroes in on a specific aspect of essay quality: relevance.

Relevance refers to how well a student stays on topic throughout their essay. The researchers propose a new way to automatically score this relevance. Their approach uses dense retrieval encoders to create embeddings, or numerical representations, of essays at different relevance levels. These embeddings form distinct clusters, with the centroids (average positions) of the clusters representing different relevance levels.

To score a new essay, the researchers simply find the 1-nearest neighbor - the cluster centroid that is closest to the essay's embedding. This gives them the relevance score. The key is that the Contriever encoder they use is trained in an unsupervised way, without needing manually labeled essay data.

When tested on the ASAP++ dataset, this method achieved new state-of-the-art performance for scoring relevance in task-specific scenarios. It also performed on par with the best cross-task models, meaning it can be applied to essays on new topics. Additionally, the approach showed promise in reducing the amount of labeled data needed, which is valuable for practical applications.

Technical Explanation

The paper focuses on automatically scoring the relevance of student essays, a key aspect of essay quality that has received less attention than holistic essay scoring. The researchers propose a novel unsupervised approach that leverages dense retrieval encoders to create essay embeddings.

The core idea is that essays with different relevance levels will form distinct clusters in the embedding space, with the centroids of those clusters representing the different relevance levels. The researchers then use a simple 1-Nearest Neighbor classifier over those centroids to determine the relevance score of a new, unseen essay.

As the unsupervised dense encoder, the researchers use Contriever, which has demonstrated comparable performance to supervised dense retrieval models. They test their approach on both task-specific (training and testing on the same task) and cross-task (testing on unseen tasks) scenarios using the ASAP++ dataset.

In the task-specific scenario, the proposed method establishes a new state-of-the-art performance for relevance scoring. For the cross-task scenario, the extended version of the approach exhibits performance on par with the best existing model.

The researchers also analyze the performance of their method in a few-shot learning setting, where only a small amount of labeled data is available. They find that their approach can significantly reduce the labeling cost while sacrificing only about 10% of its effectiveness.

Critical Analysis

The paper presents a compelling approach to automated essay scoring, focusing on the important but understudied aspect of essay relevance. The use of unsupervised dense encoders like Contriever is an interesting innovation, as it avoids the need for manually labeled essay data.

However, the paper does not delve into potential limitations or caveats of the proposed method. For example, it would be valuable to understand how the approach performs on essays with varying levels of quality or complexity, or how sensitive it is to differences in writing styles or essay prompts.

Additionally, while the researchers test their method on the widely used ASAP++ dataset, it would be helpful to see evaluations on other essay datasets to better understand the generalizability of the approach. Transformer-based joint modelling and generalized contrastive learning techniques could also be interesting avenues for future research in this area.

Overall, the paper presents a novel and promising approach to automated essay scoring, but further research and analysis would be valuable to fully understand its capabilities and limitations.

Conclusion

This paper introduces a novel unsupervised approach for automatically scoring the relevance of student essays, a key aspect of essay quality. By leveraging dense retrieval encoders to create essay embeddings that cluster by relevance level, the researchers demonstrate state-of-the-art performance on task-specific scenarios and on-par results for cross-task settings.

The method's ability to perform well in few-shot learning scenarios, where labeled data is scarce, is particularly promising for practical applications of automated essay scoring. While the paper could benefit from further analysis of limitations and generalizability, it represents an interesting contribution to the field of automated essay scoring.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Graded Relevance Scoring of Written Essays with Dense Retrieval

Salam Albatarni, Sohaila Eltanbouly, Tamer Elsayed

Automated Essay Scoring automates the grading process of essays, providing a great advantage for improving the writing proficiency of students. While holistic essay scoring research is prevalent, a noticeable gap exists in scoring essays for specific quality traits. In this work, we focus on the relevance trait, which measures the ability of the student to stay on-topic throughout the entire essay. We propose a novel approach for graded relevance scoring of written essays that employs dense retrieval encoders. Dense representations of essays at different relevance levels then form clusters in the embeddings space, such that their centroids are potentially separate enough to effectively represent their relevance levels. We hence use the simple 1-Nearest-Neighbor classification over those centroids to determine the relevance level of an unseen essay. As an effective unsupervised dense encoder, we leverage Contriever, which is pre-trained with contrastive learning and demonstrated comparable performance to supervised dense retrieval models. We tested our approach on both task-specific (i.e., training and testing on same task) and cross-task (i.e., testing on unseen task) scenarios using the widely used ASAP++ dataset. Our method establishes a new state-of-the-art performance in the task-specific scenario, while its extension for the cross-task scenario exhibited a performance that is on par with the state-of-the-art model for that scenario. We also analyzed the performance of our approach in a more practical few-shot scenario, showing that it can significantly reduce the labeling cost while sacrificing only 10% of its effectiveness.

5/9/2024

Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression

Kun Sun, Rong Wang

Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically score English essays across multiple dimensions by employing fine-tuning and other strategies on two large datasets. The results demonstrate that our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa. Furthermore, our system outperforms existing methods in overall scoring.

6/4/2024

👨‍🏫

New!Generative Retrieval Meets Multi-Graded Relevance

Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Xueqi Cheng

Generative retrieval represents a novel approach to information retrieval. It uses an encoder-decoder architecture to directly produce relevant document identifiers (docids) for queries. While this method offers benefits, current approaches are limited to scenarios with binary relevance data, overlooking the potential for documents to have multi-graded relevance. Extending generative retrieval to accommodate multi-graded relevance poses challenges, including the need to reconcile likelihood probabilities for docid pairs and the possibility of multiple relevant documents sharing the same identifier. To address these challenges, we introduce a framework called GRaded Generative Retrieval (GR$^2$). GR$^2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. First, we create identifiers that are both semantically relevant and sufficiently distinct to represent individual documents effectively. This is achieved by jointly optimizing the relevance and distinctness of docids through a combination of docid generation and autoencoder models. Second, we incorporate information about the relationship between relevance grades to guide the training process. We use a constrained contrastive training strategy to bring the representations of queries and the identifiers of their relevant documents closer together, based on their respective relevance grades. Extensive experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$^2$.

9/30/2024

Dense Retrieval with Continuous Explicit Feedback for Systematic Review Screening Prioritisation

Xinyu Mao, Shengyao Zhuang, Bevan Koopman, Guido Zuccon

The goal of screening prioritisation in systematic reviews is to identify relevant documents with high recall and rank them in early positions for review. This saves reviewing effort if paired with a stopping criterion, and speeds up review completion if performed alongside downstream tasks. Recent studies have shown that neural models have good potential on this task, but their time-consuming fine-tuning and inference discourage their widespread use for screening prioritisation. In this paper, we propose an alternative approach that still relies on neural models, but leverages dense representations and relevance feedback to enhance screening prioritisation, without the need for costly model fine-tuning and inference. This method exploits continuous relevance feedback from reviewers during document screening to efficiently update the dense query representation, which is then applied to rank the remaining documents to be screened. We evaluate this approach across the CLEF TAR datasets for this task. Results suggest that the investigated dense query-driven approach is more efficient than directly using neural models and shows promising effectiveness compared to previous methods developed on the considered datasets. Our code is available at https://github.com/ielab/dense-screening-feedback.

7/18/2024