Auditing an Automatic Grading Model with deep Reinforcement Learning

Read original: arXiv:2405.07087 - Published 5/14/2024 by Aubrey Condor, Zachary Pardos

📈

Overview

The researchers explore using deep reinforcement learning to audit an automatic short answer grading (ASAG) model.
ASAG models are designed to match human ratings, but the researchers show this high level of agreement does not guarantee the model is infallible.
They train a reinforcement learning agent to revise student responses to achieve high ratings from the ASAG model, exposing flaws in the grading model.

Plain English Explanation

The paper looks at using a special type of artificial intelligence called deep reinforcement learning to investigate the weaknesses of a system that automatically grades short student answers. Automatic grading could save teachers time, but there are concerns about how reliable these grading systems really are.

Current ASAG models are trained to match the scores that human graders give. Researchers usually check the quality of these models by seeing how well the computer scores agree with the human scores. But the paper shows that high agreement doesn't necessarily mean the automatic grader is perfect.

The researchers trained a reinforcement learning agent, which is a type of AI that learns by trying different actions and getting feedback. They had the agent try to revise student responses in a way that would get a high score from the ASAG model, even if the revised responses wouldn't actually be considered good answers. By analyzing the agent's revised responses that got high scores, the researchers found ways the ASAG model could be tricked, revealing flaws in how it works.

Technical Explanation

The researchers trained a deep reinforcement learning agent with the goal of revising student responses to achieve high ratings from an ASAG model, while minimizing the number of revisions. By analyzing the agent's revised responses that scored highly according to the ASAG model but would not be considered good answers based on a scoring rubric, the researchers were able to identify weaknesses in the ASAG model.

The ASAG model was trained to match human ratings on a dataset of student responses. The reinforcement learning agent was trained using proximal policy optimization, a common deep RL algorithm. The agent's objective was to revise the student responses in a way that maximized the score from the ASAG model, while minimizing the number of revisions.

The researchers found that the agent was able to generate revised responses that received high scores from the ASAG model but would not be considered correct answers. This suggests that the ASAG model may be relying on superficial patterns in the text rather than truly understanding the content and quality of the responses. The paper highlights the importance of going beyond simple accuracy metrics when evaluating ASAG models, and the value of using techniques like adversarial testing to uncover potential flaws.

Critical Analysis

The paper raises important points about the limitations of evaluating ASAG models solely based on agreement with human ratings. As the researchers demonstrate, a high level of agreement does not necessarily mean the model is robust or infallible. The use of a reinforcement learning agent to systematically exploit weaknesses in the ASAG model is a clever approach that provides valuable insights.

However, the paper could have delved deeper into the specific ways in which the ASAG model was being exploited by the agent. While the authors mention that the agent was able to generate revised responses that scored highly but were not truly correct, more details on the specific tactics or patterns the agent used would have been helpful. Additionally, the paper could have discussed the potential implications of these findings for the design and evaluation of ASAG systems more broadly.

Further research could explore whether similar exploits exist in other ASAG models, or investigate approaches for making these models more robust to such adversarial attacks. Explainable AI techniques may also be valuable for understanding and improving the inner workings of ASAG systems.

Conclusion

This paper highlights the importance of going beyond simple accuracy metrics when evaluating automatic short answer grading (ASAG) models. The researchers used a deep reinforcement learning agent to systematically exploit weaknesses in an ASAG model, revealing that high agreement with human ratings does not necessarily indicate a robust or infallible system.

The findings suggest that ASAG models may be relying on superficial patterns in the text rather than truly understanding the content and quality of the responses. This underscores the need for more comprehensive evaluation methods and the exploration of techniques like adversarial testing to uncover potential flaws in these systems.

As AI-powered grading systems become more prevalent, it is crucial to ensure they are reliable and fair. The insights from this paper can inform the design and evaluation of future ASAG models, ultimately helping to improve the quality and transparency of automated assessment tools for educators and students.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

Auditing an Automatic Grading Model with deep Reinforcement Learning

Aubrey Condor, Zachary Pardos

We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model. Automatic grading may decrease the time burden of rating open-ended items for educators, but a lack of robust evaluation methods for these models can result in uncertainty of their quality. Current state-of-the-art ASAG models are configured to match human ratings from a training set, and researchers typically assess their quality with accuracy metrics that signify agreement between model and human scores. In this paper, we show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible. We train a reinforcement learning agent to revise student responses with the objective of achieving a high rating from an automatic grading model in the least number of revisions. By analyzing the agent's revised responses that achieve a high grade from the ASAG model but would not be considered a high scoring responses according to a scoring rubric, we discover ways in which the automated grader can be exploited, exposing shortcomings in the grading model.

5/14/2024

Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma, Moritz Moller, Thomas Zoller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schutt

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

5/8/2024

Explainable Automatic Grading with Neural Additive Models

Aubrey Condor, Zachary Pardos

The use of automatic short answer grading (ASAG) models may help alleviate the time burden of grading while encouraging educators to frequently incorporate open-ended items in their curriculum. However, current state-of-the-art ASAG models are large neural networks (NN) often described as black box, providing no explanation for which characteristics of an input are important for the produced output. This inexplicable nature can be frustrating to teachers and students when trying to interpret, or learn from an automatically-generated grade. To create a powerful yet intelligible ASAG model, we experiment with a type of model called a Neural Additive Model that combines the performance of a NN with the explainability of an additive model. We use a Knowledge Integration (KI) framework from the learning sciences to guide feature engineering to create inputs that reflect whether a student includes certain ideas in their response. We hypothesize that indicating the inclusion (or exclusion) of predefined ideas as features will be sufficient for the NAM to have good predictive power and interpretability, as this may guide a human scorer using a KI rubric. We compare the performance of the NAM with another explainable model, logistic regression, using the same features, and to a non-explainable neural model, DeBERTa, that does not require feature engineering.

5/2/2024

🌿

Towards LLM-based Autograding for Short Textual Answers

Johannes Schneider, Bernd Schenk, Christina Niklaus

Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while out-of-the-box LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.

7/9/2024