Research on the Application of Large Language Models in Automatic Question Generation: A Case Study of ChatGLM in the Context of High School Information Technology Curriculum

Read original: arXiv:2408.11539 - Published 8/22/2024 by Yanxin Chen, Ling He

💬

Overview

This study explores the use of the large language model (LLM) ChatGLM for automatically generating high school information technology exam questions.
Researchers designed prompts to guide ChatGLM in producing diverse questions, which were then evaluated by domain experts.
The evaluation focused on factors like alignment with teaching content, embodiment of core competencies, clarity of questions, and teachers' willingness to use them.

Plain English Explanation

The researchers wanted to see how well the ChatGLM language model could generate exam questions for high school information technology classes. They used special prompts to get ChatGLM to create a variety of questions, and then had experts review the questions.

The experts looked at things like how well the questions matched the course material, how well they tested key skills, how clear the questions were, and whether teachers would be willing to use the questions in their classes. The results showed that ChatGLM-generated questions were clearer and teachers were more willing to use them, compared to human-written questions. But there wasn't a big difference in how well the questions matched the course content or tested the right skills.

This suggests that ChatGLM could help make the process of generating exam questions more efficient and reduce the workload for teachers. But the researchers think the model could be improved to maintain the strengths while also improving the fit and relevance of the questions.

Technical Explanation

The researchers used prompt engineering strategies to guide the ChatGLM language model in generating diverse exam questions for high school information technology courses. They then had domain experts evaluate the questions across four dimensions:

Hitting: The degree of alignment between the questions and the teaching content.
Fitting: How well the questions embodied the core competencies being tested.
Clarity: How explicitly and unambiguously the questions were described.
Willing to use: The extent to which teachers would be willing to use the questions in their teaching.

The results showed that the ChatGLM-generated questions outperformed human-written questions in terms of clarity and teachers' willingness to use them. However, there was no significant difference in hit rate and fit.

These findings suggest that ChatGLM has the potential to enhance the efficiency of exam question generation and reduce the burden on teachers. However, future research may need to explore further optimizations to the model to maintain high fit and hit rates while improving the clarity and teachers' willingness to use the questions.

Critical Analysis

The researchers acknowledged that while ChatGLM showed promise in generating clear and teacher-friendly exam questions, there is still room for improvement in terms of the fit and relevance of the questions to the course content. They suggested that further optimizations to the prompting strategies or the model itself may be necessary to address these limitations.

Additionally, the study focused on a specific domain (high school information technology) and may not generalize to other subject areas or educational levels. Replicating the study in different contexts would help validate the findings and identify any unique considerations for other subject areas.

It's also worth noting that the evaluation of the questions was done by domain experts, which could introduce some subjectivity. Incorporating more objective measures, such as student performance on the generated questions, could provide additional insights into the quality and effectiveness of the questions.

Conclusion

This study demonstrates the potential of large language models like ChatGLM to enhance the efficiency of exam question generation and reduce the workload for teachers. While the ChatGLM-generated questions showed strengths in clarity and teacher acceptance, there is still room for improvement in maintaining high alignment with course content and core competencies.

Further research and optimization of the prompting strategies or the language model itself could help address these limitations, paving the way for more effective and AI-powered educational assessment systems. By leveraging the capabilities of large language models, educators may be able to streamline the question generation process and focus more on other aspects of teaching and learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Research on the Application of Large Language Models in Automatic Question Generation: A Case Study of ChatGLM in the Context of High School Information Technology Curriculum

Yanxin Chen, Ling He

This study investigates the application effectiveness of the Large Language Model (LLMs) ChatGLM in the automated generation of high school information technology exam questions. Through meticulously designed prompt engineering strategies, the model is guided to generate diverse questions, which are then comprehensively evaluated by domain experts. The evaluation dimensions include the Hitting(the degree of alignment with teaching content), Fitting (the degree of embodiment of core competencies), Clarity (the explicitness of question descriptions), and Willing to use (the teacher's willingness to use the question in teaching). The results indicate that ChatGLM outperforms human-generated questions in terms of clarity and teachers' willingness to use, although there is no significant difference in hit rate and fit. This finding suggests that ChatGLM has the potential to enhance the efficiency of question generation and alleviate the burden on teachers, providing a new perspective for the future development of educational assessment systems. Future research could explore further optimizations to the ChatGLM model to maintain high fit and hit rates while improving the clarity of questions and teachers' willingness to use them.

8/22/2024

💬

Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams

Ling He, Yanxin Chen, Xiaoqiang Hu

This study delves into the application potential of the large language models (LLMs) ChatGLM in the automatic generation of structured questions for National Teacher Certification Exams (NTCE). Through meticulously designed prompt engineering, we guided ChatGLM to generate a series of simulated questions and conducted a comprehensive comparison with questions recollected from past examinees. To ensure the objectivity and professionalism of the evaluation, we invited experts in the field of education to assess these questions and their scoring criteria. The research results indicate that the questions generated by ChatGLM exhibit a high level of rationality, scientificity, and practicality similar to those of the real exam questions across most evaluation criteria, demonstrating the model's accuracy and reliability in question generation. Nevertheless, the study also reveals limitations in the model's consideration of various rating criteria when generating questions, suggesting the need for further optimization and adjustment. This research not only validates the application potential of ChatGLM in the field of educational assessment but also provides crucial empirical support for the development of more efficient and intelligent educational automated generation systems in the future.

8/21/2024

💬

Comparison of Large Language Models for Generating Contextually Relevant Questions

Ivo Lodovico Molina, Valdemar v{S}v'abensk'y, Tsubasa Minematsu, Li Chen, Fumiya Okubo, Atsushi Shimada

This study explores the effectiveness of Large Language Models (LLMs) for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.

9/17/2024

💬

The Future of Learning: Large Language Models through the Lens of Students

He Zhang, Jingyi Xie, Chuhao Wu, Jie Cai, ChanMin Kim, John M. Carroll

As Large-Scale Language Models (LLMs) continue to evolve, they demonstrate significant enhancements in performance and an expansion of functionalities, impacting various domains, including education. In this study, we conducted interviews with 14 students to explore their everyday interactions with ChatGPT. Our preliminary findings reveal that students grapple with the dilemma of utilizing ChatGPT's efficiency for learning and information seeking, while simultaneously experiencing a crisis of trust and ethical concerns regarding the outcomes and broader impacts of ChatGPT. The students perceive ChatGPT as being more human-like compared to traditional AI. This dilemma, characterized by mixed emotions, inconsistent behaviors, and an overall positive attitude towards ChatGPT, underscores its potential for beneficial applications in education and learning. However, we argue that despite its human-like qualities, the advanced capabilities of such intelligence might lead to adverse consequences. Therefore, it's imperative to approach its application cautiously and strive to mitigate potential harms in future developments.

7/18/2024