Beyond human subjectivity and error: a novel AI grading system

Read original: arXiv:2405.04323 - Published 5/8/2024 by Alexandra Gobrecht, Felix Tuma, Moritz Moller, Thomas Zoller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schutt

Beyond human subjectivity and error: a novel AI grading system

Overview

Presents a novel AI-based grading system that aims to overcome the subjectivity and errors inherent in human grading.
Utilizes AI techniques to automatically and objectively assess student work, with the goal of providing more consistent and reliable feedback.
Explores the potential of AI to revolutionize educational assessment and reduce the burden on human graders.

Plain English Explanation

The provided paper discusses a novel AI-based grading system that seeks to address the challenges of human subjectivity and error in the assessment of student work. Traditional grading by human instructors can be influenced by individual biases, fatigue, and inconsistencies, leading to a lack of objectivity and reliability in the feedback provided to students.

The researchers behind this system have developed an AI-powered approach that aims to overcome these limitations. By leveraging advanced artificial intelligence techniques, the system is designed to automatically and objectively assess student assignments, exams, and other forms of academic work. The goal is to provide more consistent and reliable feedback, helping students to better understand their strengths, weaknesses, and areas for improvement.

This research builds upon previous work in the field of automated grading and AI-powered educational tools, exploring the potential of AI to revolutionize the way we assess student learning and reduce the burden on human graders. By automating the grading process, the system aims to offer a more scalable and efficient approach to educational assessment, with the potential to provide personalized feedback and support for students.

Technical Explanation

The paper presents a novel AI-based grading system that combines neural additive models with other advanced AI techniques to assess student work. The system is designed to analyze various aspects of student responses, including content, structure, and language use, to provide a comprehensive and objective evaluation.

The researchers have developed a multi-stage process for their grading system. First, the system uses natural language processing and machine learning algorithms to extract relevant features from student submissions. These features are then fed into the neural additive model, which is trained to predict the grade or score for each assignment.

To ensure the reliability and interpretability of the system's outputs, the researchers have incorporated explainability features that allow users to understand the reasoning behind the system's grading decisions. This includes the ability to highlight specific aspects of the student's work that influenced the final grade, as well as the relative importance of different factors in the assessment process.

The researchers have tested their system on a variety of student work samples, including short-answer questions, essays, and code submissions, and have found that it can achieve comparable or even superior performance to human graders in terms of accuracy and consistency.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their paper. One key concern is the potential for bias in the training data used to develop the AI grading system, which could lead to unfair or inaccurate assessments of certain student populations. The researchers suggest the need for careful curation and evaluation of the training data to address this issue.

Additionally, the paper does not delve deeply into the potential ethical and societal implications of automating the grading process. While the system aims to provide more objective and consistent feedback, there are concerns about the potential for AI-powered grading to exacerbate existing inequities in education or to be used in ways that undermine the human element of the learning process. Further research is needed to fully address these complex issues.

Overall, the proposed AI grading system represents a promising step forward in the field of educational assessment, but it will be crucial for researchers and practitioners to carefully consider the ethical and societal implications of such technologies as they continue to evolve.

Conclusion

The paper presents a novel AI-based grading system that aims to overcome the subjective and error-prone nature of human grading. By leveraging advanced AI techniques, the system is designed to provide more consistent, reliable, and explainable feedback to students, with the potential to revolutionize educational assessment and reduce the burden on human graders.

While the research shows promising results, it also highlights the need for continued exploration of the ethical and societal implications of automating the grading process. As AI-powered educational tools continue to advance, it will be crucial for researchers, educators, and policymakers to work together to ensure that these technologies are developed and deployed in a way that supports and enhances student learning, rather than undermining it.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma, Moritz Moller, Thomas Zoller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schutt

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

5/8/2024

📈

Auditing an Automatic Grading Model with deep Reinforcement Learning

Aubrey Condor, Zachary Pardos

We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model. Automatic grading may decrease the time burden of rating open-ended items for educators, but a lack of robust evaluation methods for these models can result in uncertainty of their quality. Current state-of-the-art ASAG models are configured to match human ratings from a training set, and researchers typically assess their quality with accuracy metrics that signify agreement between model and human scores. In this paper, we show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible. We train a reinforcement learning agent to revise student responses with the objective of achieving a high rating from an automatic grading model in the least number of revisions. By analyzing the agent's revised responses that achieve a high grade from the ASAG model but would not be considered a high scoring responses according to a scoring rubric, we discover ways in which the automated grader can be exploited, exposing shortcomings in the grading model.

5/14/2024

🌿

Towards LLM-based Autograding for Short Textual Answers

Johannes Schneider, Bernd Schenk, Christina Niklaus

Grading exams is an important, labor-intensive, subjective, repetitive, and frequently challenging task. The feasibility of autograding textual responses has greatly increased thanks to the availability of large language models (LLMs) such as ChatGPT and the substantial influx of data brought about by digitalization. However, entrusting AI models with decision-making roles raises ethical considerations, mainly stemming from potential biases and issues related to generating false information. Thus, in this manuscript, we provide an evaluation of a large language model for the purpose of autograding, while also highlighting how LLMs can support educators in validating their grading procedures. Our evaluation is targeted towards automatic short textual answers grading (ASAG), spanning various languages and examinations from two distinct courses. Our findings suggest that while out-of-the-box LLMs provide a valuable tool to provide a complementary perspective, their readiness for independent automated grading remains a work in progress, necessitating human oversight.

7/9/2024

I understand why I got this grade: Automatic Short Answer Grading with Feedback

Dishank Aggarwal, Pushpak Bhattacharyya, Bhaskaran Raman

The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF) -- a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains. We leverage state-of-the-art large language models' (LLMs) generative capabilities with our Label-Aware Synthetic Feedback Generation (LASFG) strategy to include feedback in our dataset. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. Additionally, we demonstrate the efficiency and effectiveness of the ASAG system through its deployment in a real-world end-semester exam at the Indian Institute of Technology Bombay (IITB), showcasing its practical viability and potential for broader implementation in educational institutions.

7/19/2024