Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

Read original: arXiv:2404.14301 - Published 4/23/2024 by Shashank Sonkar, Naiming Liu, Debshila B. Mallick, Richard G. Baraniuk

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

Overview

This paper presents a novel approach for visually grading student submissions and providing feedback by highlighting errors and annotating missing content.
The proposed system aims to streamline the grading process and improve the quality of feedback provided to students.
Key elements include visual error highlighting, annotation of missing content, and a user-friendly interface for instructors.

Plain English Explanation

The paper describes a new way for teachers to grade students' work and give feedback using visual cues. The system is designed to make the grading process easier and provide more helpful comments to students.

The main features of the system include:

Automatically highlighting areas in the student's work that have errors or mistakes.
Allowing the teacher to add notes and comments directly onto the student's work to point out what is missing or needs improvement.
Providing a user-friendly interface for the teacher to view the student's work and add their feedback.

The goal is to make the grading process more efficient and to give students clearer, more detailed feedback on their assignments. This could help students better understand where they need to improve and ultimately learn more effectively.

Technical Explanation

The paper introduces a new approach for visually grading student submissions and providing detailed feedback. The system leverages Evaluating Generative Language Models for Information Extraction as a Service and Augmenting NER Datasets with LLMs Towards Automated Refined Entity Annotations to automatically identify errors and missing content in student work.

The key components include:

An interface that displays the student's work with visual highlighting of errors.
Annotation tools that allow instructors to provide feedback by marking up missing elements directly on the submission.
Integration with large language models like AnnolLM: Making Large Language Models to be Annotators to assist with error detection and feedback generation.

The authors evaluated the system through a user study with instructors, who reported that it improved the efficiency and quality of their grading process compared to traditional methods. The visual feedback was seen as especially helpful for identifying and communicating issues in student work.

Critical Analysis

The paper presents a promising approach to enhancing the grading and feedback process, but there are some potential limitations and areas for further research:

The evaluation was relatively small-scale and focused on instructor usability. Additional studies are needed to assess the long-term impact on student learning and engagement.
The system's reliance on large language models raises concerns about the reliability and biases of the underlying technologies.
Further work is needed to refine the error detection and annotation capabilities to ensure accurate and constructive feedback.

Overall, the proposed visual grading system represents an innovative step toward improving the feedback loop between instructors and students. With continued research and refinement, it could become a valuable tool for enhancing educational outcomes.

Conclusion

This paper introduces a novel approach for visually grading student submissions and providing targeted feedback. By automatically highlighting errors and allowing instructors to annotate missing content, the system aims to streamline the grading process and deliver more effective guidance to students.

The key contributions include the design of a user-friendly interface, the integration of large language models to assist with error detection and feedback generation, and the evaluation of the system's impact on instructor workflows. While further research is needed to address potential limitations, the visual grading approach shows promise for improving learning outcomes and the overall educational experience.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

Shashank Sonkar, Naiming Liu, Debshila B. Mallick, Richard G. Baraniuk

In this paper, we introduce Marking, a novel grading task that enhances automated grading systems by performing an in-depth analysis of student responses and providing students with visual highlights. Unlike traditional systems that provide binary scores, marking identifies and categorizes segments of the student response as correct, incorrect, or irrelevant and detects omissions from gold answers. We introduce a new dataset meticulously curated by Subject Matter Experts specifically for this task. We frame Marking as an extension of the Natural Language Inference (NLI) task, which is extensively explored in the field of Natural Language Processing. The gold answer and the student response play the roles of premise and hypothesis in NLI, respectively. We subsequently train language models to identify entailment, contradiction, and neutrality from student response, akin to NLI, and with the added dimension of identifying omissions from gold answers. Our experimental setup involves the use of transformer models, specifically BERT and RoBERTa, and an intelligent training step using the e-SNLI dataset. We present extensive baseline results highlighting the complexity of the Marking task, which sets a clear trajectory for the upcoming study. Our work not only opens up new avenues for research in AI-powered educational assessment tools, but also provides a valuable benchmark for the AI in education community to engage with and improve upon in the future. The code and dataset can be found at https://github.com/luffycodes/marking.

4/23/2024

🤯

VariErr NLI: Separating Annotation Error from Human Label Variation

Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

6/7/2024

Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma, Moritz Moller, Thomas Zoller, Mark Zakhvatkin, Alexandra Wuttig, Holger Sommerfeldt, Sven Schutt

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

5/8/2024

💬

Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

Jacob Eisenstein, Daniel Andor, Bernd Bohnet, Michael Collins, David Mimno

Explainable question answering systems should produce not only accurate answers but also rationales that justify their reasoning and allow humans to check their work. But what sorts of rationales are useful and how can we train systems to produce them? We propose a new style of rationale for open-book question answering, called emph{markup-and-mask}, which combines aspects of extractive and free-text explanations. In the markup phase, the passage is augmented with free-text markup that enables each sentence to stand on its own outside the discourse context. In the masking phase, a sub-span of the marked-up passage is selected. To train a system to produce markup-and-mask rationales without annotations, we leverage in-context learning. Specifically, we generate silver annotated data by sending a series of prompts to a frozen pretrained language model, which acts as a teacher. We then fine-tune a smaller student model by training on the subset of rationales that led to correct answers. The student is honest in the sense that it is a pipeline: the rationale acts as a bottleneck between the passage and the answer, while the untrusted teacher operates under no such constraints. Thus, we offer a new way to build trustworthy pipeline systems from a combination of end-task annotations and frozen pretrained language models.

4/26/2024