Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank

Read original: arXiv:2405.05144 - Published 5/15/2024 by Alexander Scarlatos, Wanyong Feng, Digory Smith, Simon Woodhead, Andrew Lan

🛸

Overview

The paper explores a method for automatically generating high-quality distractors (incorrect answer choices) for math multiple-choice questions (MCQs).
MCQs are widely used in math education, but generating effective distractors is challenging, especially at scale.
The proposed "overgenerate-and-rank" approach uses a language model to generate a large set of potential distractors, then trains a ranking model to identify the most plausible ones.
Experiments on real-world data and human evaluation with math teachers show that the ranked distractors are more aligned with human-authored ones, though the human-authored distractors are still preferred.

Plain English Explanation

Multiple-choice questions (MCQs) are a common way to test students' math knowledge, as they can be easily graded at a large scale. A key part of MCQs is the "distractors" - the incorrect answer choices that are designed to reflect common student mistakes or misconceptions.

Generating high-quality distractors can be challenging, especially when done automatically using large language models. In this paper, the researchers propose a new method to improve the distractors generated by language models.

Their approach is to "overgenerate" a large number of potential distractors, then use a separate "ranking" model to identify the most plausible ones. This way, the language model can quickly produce a wide range of options, and the ranking model can select the ones that are most likely to fool real students, based on patterns in existing human-created distractors.

The researchers tested this method on real-world data and had math teachers evaluate the results. They found that the ranked distractors were more aligned with the human-authored ones, even though the teachers still preferred the human-created options overall. This suggests the automated approach can be a useful tool to help teachers create high-quality MCQs, even if it doesn't fully match human performance.

Technical Explanation

The paper proposes a novel "overgenerate-and-rank" approach to automatically generating high-quality distractors for math multiple-choice questions (MCQs). The key components are:

Distractor Overgeneration: A large language model is used to quickly generate a large pool of potential distractors for a given math question.
Distractor Ranking: A separate ranking model is trained to predict how likely each generated distractor is to be selected by real students, based on patterns in existing human-authored distractors. This allows filtering the overgenerated options to identify the most plausible ones.

The researchers evaluated this approach on a real-world dataset of math MCQs. They found that the ranked distractors generated by their method showed greater alignment with human-authored distractors, as measured by overlap and human evaluation. However, the human-authored distractors were still preferred overall, indicating there is room for improvement in the automated approach.

The paper also discusses limitations and areas for future work, such as incorporating more explicit math reasoning and using human feedback to iteratively refine the ranking model.

Critical Analysis

The researchers present a promising approach to address the challenge of automatically generating high-quality distractors for math MCQs. The "overgenerate-and-rank" method leverages the strengths of language models while using a separate ranking model to filter the outputs and align them better with human-authored distractors.

One key limitation noted in the paper is that the human-authored distractors are still preferred over the automatically generated ones, even after ranking. This suggests there are aspects of effective distractor design that the current approach does not fully capture. Further research may be needed to better understand the cognitive and pedagogical principles behind high-quality distractors.

Additionally, the paper focuses on generating distractors for math MCQs, but the approach could potentially be extended to other domains, such as reading comprehension or instruction-following tasks. Exploring the broader applicability of the method could be an interesting direction for future work.

Overall, this research represents a meaningful step towards more effective automated generation of MCQ distractors, with potential benefits for math education at scale. Continued refinement and evaluation of the approach could lead to further improvements in this important area.

Conclusion

This paper presents a novel "overgenerate-and-rank" method for automatically generating high-quality distractors for math multiple-choice questions (MCQs). By leveraging language models to quickly produce a wide range of potential distractors, and then using a separate ranking model to identify the most plausible options, the researchers were able to create distractors that more closely aligned with human-authored ones.

While the human-authored distractors were still preferred, the automated approach shows promise as a tool to assist teachers in creating effective MCQs at scale. Further research may be needed to better understand the cognitive and pedagogical factors that contribute to high-quality distractors, and to explore the broader applicability of the overgenerate-and-rank method beyond the math domain.

Overall, this work represents a meaningful step forward in the challenge of automating the generation of high-quality multiple-choice question distractors, which could have important implications for math education and assessment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank

Alexander Scarlatos, Wanyong Feng, Digory Smith, Simon Woodhead, Andrew Lan

Multiple-choice questions (MCQs) are commonly used across all levels of math education since they can be deployed and graded at a large scale. A critical component of MCQs is the distractors, i.e., incorrect answers crafted to reflect student errors or misconceptions. Automatically generating them in math MCQs, e.g., with large language models, has been challenging. In this work, we propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank, training a ranking model to predict how likely distractors are to be selected by real students. Experimental results on a real-world dataset and human evaluation with math teachers show that our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.

5/15/2024

Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models

Wanyong Feng, Jaewook Lee, Hunter McNichols, Alexander Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, Andrew Lan

Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.

4/19/2024

🛸

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Jaewook Lee, Digory Smith, Simon Woodhead, Andrew Lan

Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.

5/3/2024

Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

Han-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung Fan

In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset.

6/21/2024