Evaluating Language Models for Generating and Judging Programming Feedback

Read original: arXiv:2407.04873 - Published 7/9/2024 by Charles Koutcheme, Nicola Dainese, Arto Hellas, Sami Sarsa, Juho Leinonen, Syed Ashraf, Paul Denny

💬

Overview

This paper evaluates the ability of large language models (LLMs) to generate and judge programming feedback.
The researchers explore how well LLMs can provide useful feedback on programming exercises, which could be valuable for education and training.
They test LLMs on their ability to generate feedback and also to evaluate the quality of human-written feedback.

Plain English Explanation

The researchers in this paper wanted to see how well open source language models can be used to give feedback on programming assignments. This could be really helpful for things like online programming courses, where students need feedback but it's hard for instructors to review everything.

The researchers tested the models in two ways. First, they had the models try to generate their own feedback on some programming exercises. This showed how well the models could come up with useful comments and suggestions on their own. Second, they had the models judge the quality of feedback that was written by humans. This let the researchers see if the models could identify good feedback, which could be useful for automatically evaluating student work.

Overall, the results suggest that large language models can provide useful programming feedback and evaluate the quality of that feedback. This could be a big help for things like online programming courses, where automated feedback and assessment is really important. Of course, the models aren't perfect, and the researchers discuss some of the limitations and areas for further research.

Technical Explanation

The paper explores the use of large language models (LLMs) for generating and evaluating programming feedback. The researchers tested two main capabilities:

Feedback generation: They fine-tuned several LLMs, including GPT-3, on a dataset of human-written programming feedback. They then had the models generate their own feedback on a set of programming exercises.
Feedback evaluation: The researchers also trained LLMs to judge the quality of human-written programming feedback. They did this by having the models score the feedback based on its accuracy, specificity, and helpfulness.

To evaluate the models' performance, the researchers recruited human raters to assess the quality of the generated and evaluated feedback. This allowed them to compare the model outputs to human judgments.

The results showed that the LLMs were able to generate feedback that was rated as moderately helpful by the human raters. The models also demonstrated the ability to accurately evaluate the quality of human-written feedback, with their scores correlating well with the human ratings.

The paper discusses several implications of these findings, including the potential for using LLMs to assist in the automated assessment of programming skills and the ability to leverage LLMs to generate personalized programming exercises and feedback. The researchers also acknowledge the limitations of their approach, such as the need for further research on model robustness and the potential for biases in the feedback.

Critical Analysis

The paper provides a comprehensive evaluation of LLMs for generating and judging programming feedback, which is an important step towards leveraging large language models for automated programming education. The researchers have designed thoughtful experiments to assess the models' capabilities and limitations.

One potential concern is the reliance on human raters to evaluate the quality of the generated and evaluated feedback. While this approach is reasonable, it introduces the possibility of human biases and inconsistencies in the assessment. It would be valuable to explore more objective metrics for feedback quality, such as the ability of the feedback to actually improve student performance on programming tasks.

Additionally, the paper does not delve deeply into the potential ethical implications of using LLMs for programming feedback and assessment. There are concerns around AI-generated feedback potentially reinforcing biases or providing inaccurate guidance. Further research is needed to ensure that these systems are transparent, accountable, and equitable.

Overall, the paper presents a solid foundation for understanding the capabilities and limitations of LLMs in the context of programming feedback. However, continued research and careful consideration of the ethical implications will be crucial as these technologies are further developed and deployed in educational settings.

Conclusion

This paper demonstrates that large language models can be effectively used to generate and evaluate programming feedback. The researchers have shown that LLMs can produce feedback that is rated as moderately helpful by human raters, and can also accurately judge the quality of human-written feedback.

These findings suggest that LLMs could be a valuable tool for automating and scaling programming education and assessment. By leveraging the language understanding and generation capabilities of these models, it may be possible to provide personalized, real-time feedback to students and assist instructors in evaluating programming assignments.

However, the researchers also highlight the need for further research to address the limitations of their approach, such as the potential for biases and the need for more objective evaluation metrics. Additionally, careful consideration of the ethical implications of using LLMs in this context will be crucial to ensure that these systems are fair, transparent, and beneficial to all students.

Overall, this paper represents an important step towards utilizing the power of large language models to improve programming education and assessment. The findings and insights presented here will likely spur further advancements in this promising area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluating Language Models for Generating and Judging Programming Feedback

Charles Koutcheme, Nicola Dainese, Arto Hellas, Sami Sarsa, Juho Leinonen, Syed Ashraf, Paul Denny

The emergence of large language models (LLMs) has transformed research and practice in a wide range of domains. Within the computing education research (CER) domain, LLMs have received plenty of attention especially in the context of learning programming. Much of the work on LLMs in CER has however focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments, and in judging the quality of the programming feedback, contrasting the results against proprietary models. Our evaluations on a dataset of students' submissions to Python introductory programming exercises suggest that the state-of-the-art open-source LLMs (Meta's Llama3) are almost on-par with proprietary models (GPT-4o) in both the generation and assessment of programming feedback. We further demonstrate the efficiency of smaller LLMs in the tasks, and highlight that there are a wide range of LLMs that are accessible even for free for educators and practitioners.

7/9/2024

💬

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Paul Denny

Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings.

5/9/2024

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

Nachiket Kotalwar, Alkis Gotovos, Adish Singla

Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors' quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM's in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.

6/10/2024

💬

Evaluation of the Programming Skills of Large Language Models

Luc Bryan Heitz, Joun Chamas, Christopher Scherb

The advent of Large Language Models (LLM) has revolutionized the efficiency and speed with which tasks are completed, marking a significant leap in productivity through technological innovation. As these chatbots tackle increasingly complex tasks, the challenge of assessing the quality of their outputs has become paramount. This paper critically examines the output quality of two leading LLMs, OpenAI's ChatGPT and Google's Gemini AI, by comparing the quality of programming code generated in both their free versions. Through the lens of a real-world example coupled with a systematic dataset, we investigate the code quality produced by these LLMs. Given their notable proficiency in code generation, this aspect of chatbot capability presents a particularly compelling area for analysis. Furthermore, the complexity of programming code often escalates to levels where its verification becomes a formidable task, underscoring the importance of our study. This research aims to shed light on the efficacy and reliability of LLMs in generating high-quality programming code, an endeavor that has significant implications for the field of software development and beyond.

5/24/2024