Explainable Automatic Grading with Neural Additive Models

Read original: arXiv:2405.00489 - Published 5/2/2024 by Aubrey Condor, Zachary Pardos

Explainable Automatic Grading with Neural Additive Models

Overview

This paper presents a novel approach called Neural Additive Models (NAMs) for explainable automatic grading of student responses.
NAMs combine the flexibility of neural networks with the interpretability of additive models, allowing the model to provide clear explanations for its predictions.
The researchers apply NAMs to the task of grading short-answer questions, demonstrating their effectiveness on the RICEChem dataset.

Plain English Explanation

The Road to Clarity: Exploring the Explainable AI World discusses the importance of building AI systems that can explain their decision-making process. This is particularly crucial in high-stakes applications like education, where students and teachers need to understand how an automatic grading system arrives at its scores.

The researchers in this paper have developed a new type of AI model called a Neural Additive Model (NAM) that aims to provide these explanations. NAMs combine the power of neural networks, which can learn complex patterns in data, with the interpretability of additive models, which break down the overall prediction into contributions from different input features.

When applied to the task of automatically grading short-answer questions, NAMs can not only produce a grade, but also explain which parts of the student's response contributed most to that grade. This allows teachers and students to understand the reasoning behind the model's assessment, which can help them improve the quality of future responses.

The researchers tested their NAM approach on the RICEChem dataset, a collection of student responses to chemistry questions. They found that NAMs performed well in terms of grading accuracy, while also providing clear and meaningful explanations for their predictions.

Technical Explanation

The researchers propose using Neural Additive Models (NAMs) as a way to achieve explainable automatic grading. NAMs are a type of neural network that learns a linear combination of learned nonlinear functions of the input features.

To apply NAMs to the task of short-answer grading, the researchers first encode the text of each student response using a pre-trained language model. They then feed these encodings into the NAM, which learns a weighted sum of nonlinear transformations of the input features to produce a final grade prediction.

Importantly, the NAM architecture allows the researchers to extract explanations for the model's predictions. By analyzing the contribution of each input feature to the final grade, they can identify which aspects of the student's response were most influential in determining the overall score.

The researchers evaluate their NAM-based grading system on the RICEChem dataset, which contains short-answer responses to chemistry questions. They compare the performance of NAMs to other interpretable models, as well as to a more complex large multi-modality model that does not provide explanations. The results show that NAMs achieve competitive grading accuracy while also offering clear and meaningful explanations for their predictions.

Critical Analysis

The researchers acknowledge several limitations of their work. First, the NAM approach may not be able to capture all the nuances and complexities of student responses, which could limit its grading accuracy compared to more powerful but less interpretable models.

Additionally, the researchers only evaluated their system on a single dataset, the RICEChem dataset, which may not be representative of the full range of short-answer questions and student responses that an automatic grading system would need to handle in practice.

Further research would be needed to assess the generalizability of the NAM approach to other educational domains and question types. It would also be valuable to gather feedback from teachers and students on the usefulness and clarity of the explanations provided by the NAM-based grading system.

Conclusion

This paper presents a promising approach to building explainable automatic grading systems using Neural Additive Models. By combining the flexibility of neural networks with the interpretability of additive models, NAMs can provide both accurate grades and clear explanations for their predictions.

The researchers demonstrate the effectiveness of their NAM-based system on the RICEChem dataset, suggesting that this technique could be a valuable tool for improving transparency and trust in AI-powered educational assessment. As the use of AI in education continues to grow, developing explainable systems like this will be crucial for ensuring that these technologies benefit students and teachers alike.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Explainable Automatic Grading with Neural Additive Models

Aubrey Condor, Zachary Pardos

The use of automatic short answer grading (ASAG) models may help alleviate the time burden of grading while encouraging educators to frequently incorporate open-ended items in their curriculum. However, current state-of-the-art ASAG models are large neural networks (NN) often described as black box, providing no explanation for which characteristics of an input are important for the produced output. This inexplicable nature can be frustrating to teachers and students when trying to interpret, or learn from an automatically-generated grade. To create a powerful yet intelligible ASAG model, we experiment with a type of model called a Neural Additive Model that combines the performance of a NN with the explainability of an additive model. We use a Knowledge Integration (KI) framework from the learning sciences to guide feature engineering to create inputs that reflect whether a student includes certain ideas in their response. We hypothesize that indicating the inclusion (or exclusion) of predefined ideas as features will be sufficient for the NAM to have good predictive power and interpretability, as this may guide a human scorer using a KI rubric. We compare the performance of the NAM with another explainable model, logistic regression, using the same features, and to a non-explainable neural model, DeBERTa, that does not require feature engineering.

5/2/2024

📈

Auditing an Automatic Grading Model with deep Reinforcement Learning

Aubrey Condor, Zachary Pardos

We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model. Automatic grading may decrease the time burden of rating open-ended items for educators, but a lack of robust evaluation methods for these models can result in uncertainty of their quality. Current state-of-the-art ASAG models are configured to match human ratings from a training set, and researchers typically assess their quality with accuracy metrics that signify agreement between model and human scores. In this paper, we show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible. We train a reinforcement learning agent to revise student responses with the objective of achieving a high rating from an automatic grading model in the least number of revisions. By analyzing the agent's revised responses that achieve a high grade from the ASAG model but would not be considered a high scoring responses according to a scoring rubric, we discover ways in which the automated grader can be exploited, exposing shortcomings in the grading model.

5/14/2024

Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma, Moritz Moller, Thomas Zoller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schutt

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

5/8/2024

I understand why I got this grade: Automatic Short Answer Grading with Feedback

Dishank Aggarwal, Pushpak Bhattacharyya, Bhaskaran Raman

The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF) -- a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains. We leverage state-of-the-art large language models' (LLMs) generative capabilities with our Label-Aware Synthetic Feedback Generation (LASFG) strategy to include feedback in our dataset. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. Additionally, we demonstrate the efficiency and effectiveness of the ASAG system through its deployment in a real-world end-semester exam at the Indian Institute of Technology Bombay (IITB), showcasing its practical viability and potential for broader implementation in educational institutions.

7/19/2024