UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions

2404.13343

Published 4/23/2024 by Ana-Cristina Rogoz, Radu Tudor Ionescu

UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions

Abstract

This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs (Falcon, Meditron, Mistral) and employing transformer-based models based on six alternative feature combinations. The results suggest that predicting the difficulty of questions is more challenging. Notably, our top performing methods consistently include the question text, and benefit from the variability of LLM answers, highlighting the potential of LLMs for improving automated assessment in medical licensing exams. We make our code available https://github.com/ana-rogoz/BEA-2024.

Create account to get full access

Overview

This paper introduces a new model called UnibucLLM that uses large language models (LLMs) to automatically predict the difficulty and response time of multiple-choice questions.
The model is trained on a dataset of multiple-choice questions and can estimate the difficulty and time it would take for a student to answer each question.
This could be useful for creating adaptive assessments that adjust the difficulty level based on a student's performance.

Plain English Explanation

The researchers developed a new AI model called UnibucLLM that can look at multiple-choice questions and predict how difficult they are and how long it would take a student to answer them. This is done by training the model on a large dataset of existing multiple-choice questions.

The idea is that this could help create more personalized and adaptive tests, where the difficulty of the questions adjusts based on how the student is performing. For example, if the student is getting questions right quickly, the test could start presenting harder questions. And if the student is struggling, it could give easier questions.

This kind of adaptive testing could be more engaging for students and provide a better assessment of their knowledge, compared to a one-size-fits-all test. The UnibucLLM model tries to automate this process by using powerful language models to analyze the questions and make difficulty and time predictions.

Technical Explanation

The paper introduces the UnibucLLM model, which leverages large language models (LLMs) to automatically predict the difficulty and response time for multiple-choice questions. The model is trained on a dataset of existing multiple-choice questions, using the question text and metadata (e.g., subject, grade level) as input features.

The researchers experimented with different LLM architectures, including GPT-3 and RoBERTa, and found that fine-tuning these models on the question dataset led to the best performance on predicting item difficulty and response time.

The model's outputs could be used to create adaptive assessments that adjust question difficulty based on a student's performance, potentially improving the assessment experience and providing more accurate measurements of student knowledge.

Critical Analysis

The paper provides a promising approach for automating the prediction of multiple-choice question difficulty and response time using LLMs. However, a few potential limitations are worth noting:

The model was trained and evaluated on a specific dataset of questions, so its generalizability to other domains or question types is unclear. Further testing on more diverse datasets would be needed.
The paper does not address potential biases that could be present in the training data, which could lead to biased predictions for certain demographic groups.
While the adaptive assessment concept is compelling, the paper does not provide empirical evidence of the real-world impact on student learning or assessment quality.

Overall, the UnibucLLM model represents an interesting step towards automating multiple-choice question analysis, but additional research is needed to fully understand its limitations and potential benefits in educational settings.

Conclusion

This paper introduces the UnibucLLM model, which uses large language models to automatically predict the difficulty and response time of multiple-choice questions. The model could enable the creation of more adaptive assessments that adjust question difficulty based on student performance, potentially improving the assessment experience and providing better insights into student knowledge.

While the approach seems promising, the paper highlights the need for further research to address potential biases and generalizability issues. Overall, the UnibucLLM model represents an interesting application of advanced language models in the field of educational technology, with implications for the design of more personalized and effective assessments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain

Burcu Sayin, Pasquale Minervini, Jacopo Staiano, Andrea Passerini

We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.

5/7/2024

cs.CL cs.AI

Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

Maxime Griot, Jean Vanderdonckt, Demet Yuksel, Coralie Hemptinne

Large Language Models (LLMs) like ChatGPT demonstrate significant potential in the medical field, often evaluated using multiple-choice questions (MCQs) similar to those found on the USMLE. Despite their prevalence in medical education, MCQs have limitations that might be exacerbated when assessing LLMs. To evaluate the effectiveness of MCQs in assessing the performance of LLMs, we developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We used GPT-4 to generate a comprehensive textbook on the Glianorex in both English and French and developed corresponding multiple-choice questions in both languages. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting. The models achieved average scores around 67%, with minor performance differences between larger and smaller models. Performance was slightly higher in English than in French. Fine-tuned medical models showed some improvement over their base versions in English but not in French. The uniformly high performance across models suggests that traditional MCQ-based benchmarks may not accurately measure LLMs' clinical knowledge and reasoning abilities, instead highlighting their pattern recognition skills. This study underscores the need for more robust evaluation methods to better assess the true capabilities of LLMs in medical contexts.

6/5/2024

cs.CL cs.AI cs.LG

🛸

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Jaewook Lee, Digory Smith, Simon Woodhead, Andrew Lan

Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.

5/3/2024

cs.CL

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, Noa Garcia

Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.

5/24/2024

cs.CL