AI-Assisted Generation of Difficult Math Questions

Read original: arXiv:2407.21009 - Published 9/4/2024 by Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh Goyal

🛸

Overview

The paper presents a framework that combines the strengths of large language models (LLMs) and human input to generate a diverse array of challenging math questions.
The approach leverages LLM metacognition skills to extract core math skills from existing datasets, then generates novel questions by prompting the LLM with random pairs of these skills.
The resulting dataset, MATH^2, exhibits higher difficulty compared to the original MATH dataset, as evidenced by lower performance of AI models and higher performance when used as in-context examples.
The authors suggest this methodology could be applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight.

Plain English Explanation

The researchers recognized that while LLMs have strong mathematical reasoning capabilities, there is still a need for more diverse and challenging math questions to further test and develop these skills. Relying solely on human experts to generate these questions is time-consuming and costly.

To address this, the researchers developed a framework that combines the strengths of LLMs and human input. First, they leverage the LLM's metacognitive abilities to extract core math skills from existing datasets. These core skills then serve as the basis for generating novel and difficult math questions by prompting the LLM to combine random pairs of these skills.

The resulting questions are designed to be "out of distribution" tasks, meaning they are challenging for both LLMs and humans to solve. The framework uses an iterative process where the LLM generates and refines the questions and solutions, and human annotators then verify and further improve them.

Applying this approach to the MATH dataset resulted in a new dataset called MATH^2, which contains higher-quality math questions. This is evidenced by the fact that AI models perform worse on MATH^2 compared to the original MATH dataset, but perform better on MATH when using MATH^2 questions as in-context examples.

The researchers believe this methodology could be applicable to other domains beyond mathematics that require structured reasoning, and it could also be used as part of a system for scalable oversight of AI systems.

Technical Explanation

The key elements of the research paper are as follows:

Motivation: The researchers recognized the need for diverse and challenging math questions to further develop LLM mathematical reasoning capabilities, as current LLM training positions this as a core capability. Relying solely on human experts is time-consuming and costly, while LLM-generated questions often lack the required diversity and difficulty.
Approach: The researchers developed a framework that combines the strengths of LLMs and human input. They leverage the LLM's metacognition skills to extract core math skills from existing datasets, such as the MATH dataset. These core skills serve as the basis for generating novel and difficult math questions by prompting the LLM with random pairs of skills.
Pipeline: The framework employs an iterative process where the LLM generates and refines the questions and solutions through multi-turn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via additional LLM interactions.
Evaluation: Applying this pipeline to the MATH dataset resulted in the MATH^2 dataset, which contains higher-quality math questions. This is evidenced by: a) Lower performance of all models on MATH^2 compared to MATH b) Higher performance on MATH when using MATH^2 questions as in-context examples
Insights: The researchers observed a striking relationship between models' performance on the new dataset: the success rate on MATH^2 is the square of the success rate on MATH, suggesting that successfully solving MATH^2 questions requires a nontrivial combination of two distinct math skills.

Critical Analysis

The researchers acknowledge that their methodology is focused on mathematics, but they believe it could be applicable to other domains requiring structured reasoning. However, they do not provide concrete examples or details on how it might be applied to other areas.

Additionally, the paper does not discuss potential limitations or caveats of the approach, such as the scalability of the human-in-the-loop component or the generalizability of the findings to other types of math skills or datasets.

It would be valuable for the researchers to explore the robustness of their approach by testing it on a wider range of math datasets or skill types, and to consider potential biases or shortcomings that could arise from the LLM-based skill extraction and question generation processes.

Conclusion

The research presents a promising framework that combines the strengths of LLMs and human input to generate a diverse array of challenging math questions. The resulting MATH^2 dataset exhibits higher difficulty compared to the original MATH dataset, suggesting the potential of this approach to push the boundaries of LLM mathematical reasoning capabilities.

While the focus is on mathematics, the researchers believe this methodology could be applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight systems for AI. Further exploration of the approach's limitations and generalizability could help solidify its impact and inform future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

AI-Assisted Generation of Difficult Math Questions

Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, Anirudh Goyal

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core skills from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an out of distribution task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.

9/4/2024

Caught in the Quicksand of Reasoning, Far from AGI Summit: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, Soujanya Poria

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.

6/28/2024

🛸

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Jaewook Lee, Digory Smith, Simon Woodhead, Andrew Lan

Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.

5/3/2024

Adversarial Math Word Problem Generation

Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra

Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.

6/18/2024