Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Read original: arXiv:2408.10947 - Published 8/21/2024 by Yuyan Chen, Chenwei Wu, Songzhou Yan, Panjun Liu, Haoyu Zhou, Yanghua Xiao

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Overview

This paper introduces Dr.Academy, a benchmark for evaluating the questioning capability of large language models in an educational context.
The benchmark consists of a diverse set of questions across different subject areas, designed to assess a model's ability to engage in productive educational dialogues.
The paper presents the benchmark dataset, task setups, and initial experiments with several prominent language models.

Plain English Explanation

The researchers have developed a new benchmark called Dr.Academy to evaluate how well large language models can engage in educational conversations. The benchmark includes a collection of questions across various subjects, designed to test a model's ability to ask relevant and insightful questions that could help a student learn.

The goal is to see how capable these powerful language models are at adapting to an educational context and supporting interactive learning, beyond just generating responses. This could be an important step in leveraging language models for educational purposes.

The researchers have tested several prominent language models on the Dr.Academy benchmark and provide initial insights into their strengths and limitations in this domain. This work contributes to our understanding of how large language models can be applied in educational settings.

Technical Explanation

The paper introduces the Dr.Academy benchmark, which is designed to evaluate the questioning capability of large language models in an educational context. The benchmark consists of a diverse set of questions across different subject areas, including science, math, history, and literature.

The questions are structured to assess a model's ability to engage in productive educational dialogues, such as asking clarifying questions, probing for deeper understanding, or suggesting related topics for exploration. The researchers have created several task setups, including open-ended question generation, multiple-choice question answering, and interactive question-answering.

The paper presents the dataset creation process, including the curation of questions from various educational resources and the incorporation of different levels of difficulty and cognitive complexity. The researchers have also developed specific evaluation metrics to assess the quality and relevance of the generated questions.

The paper reports the results of initial experiments conducted with several prominent language models, such as GPT-3, InstructGPT, and LLaMA. The models' performance is analyzed across the different task setups, and the findings highlight both the strengths and limitations of these models in the educational context.

Critical Analysis

The Dr.Academy benchmark is a valuable contribution to the field of educational technology, as it provides a standardized way to evaluate the capabilities of large language models in engaging with students and supporting interactive learning. The diverse set of questions and the incorporation of different task setups allow for a comprehensive assessment of a model's questioning abilities.

However, the paper acknowledges several limitations and areas for further research. For example, the benchmark is currently limited to English-language questions, and there is a need to explore its applicability in other languages and cultural contexts. Additionally, the evaluation metrics used in the initial experiments may not capture all aspects of effective educational dialogue, and further refinement of the assessment criteria could be beneficial.

Another potential limitation is the reliance on existing language models, which may not be specifically designed or fine-tuned for educational applications. Exploring the development of models that are tailored for educational tasks could lead to significant improvements in their performance on the Dr.Academy benchmark.

Moreover, the paper does not address the potential ethical and societal implications of using large language models in educational settings, such as issues related to bias, privacy, and fairness. As these models become more prevalent in educational technologies, it is crucial to consider these important considerations.

Conclusion

The Dr.Academy benchmark is a valuable contribution to the field of educational technology, as it provides a standardized way to evaluate the questioning capability of large language models in an educational context. The diverse set of questions and the incorporation of different task setups allow for a comprehensive assessment of a model's ability to engage in productive educational dialogues.

The initial experiments with prominent language models highlight both the strengths and limitations of these models in the educational domain, and the findings can inform the development of more effective educational technologies leveraging large language models. As the field continues to evolve, it will be essential to address the limitations and potential ethical considerations identified in the paper, in order to ensure that these technologies are used responsibly and equitably to support student learning and growth.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Yuyan Chen, Chenwei Wu, Songzhou Yan, Panjun Liu, Haoyu Zhou, Yanghua Xiao

Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

8/21/2024

💬

Comparison of Large Language Models for Generating Contextually Relevant Questions

Ivo Lodovico Molina, Valdemar v{S}v'abensk'y, Tsubasa Minematsu, Li Chen, Fumiya Okubo, Atsushi Shimada

This study explores the effectiveness of Large Language Models (LLMs) for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.

9/17/2024

💬

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Owen Henkel, Adam Boxer, Libby Hills, Bill Roberts

This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel, a quizzing platform. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75). This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters. The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education and has important implications for real-world education delivery.

5/7/2024

💬

Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

In the era of generative artificial intelligence (AI), the fusion of large language models (LLMs) offers unprecedented opportunities for innovation in the field of modern education. We embark on an exploration of prompted LLMs within the context of educational and assessment applications to uncover their potential. Through a series of carefully crafted research questions, we investigate the effectiveness of prompt-based techniques in generating open-ended questions from school-level textbooks, assess their efficiency in generating open-ended questions from undergraduate-level technical textbooks, and explore the feasibility of employing a chain-of-thought inspired multi-stage prompting approach for language-agnostic multiple-choice question (MCQ) generation. Additionally, we evaluate the ability of prompted LLMs for language learning, exemplified through a case study in the low-resource Indian language Bengali, to explain Bengali grammatical errors. We also evaluate the potential of prompted LLMs to assess human resource (HR) spoken interview transcripts. By juxtaposing the capabilities of LLMs with those of human experts across various educational tasks and domains, our aim is to shed light on the potential and limitations of LLMs in reshaping educational practices.

5/21/2024