Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

2406.05053

Published 6/10/2024 by Nachiket Kotalwar, Alkis Gotovos, Adish Singla

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

Abstract

Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors' quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM's in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.

Create account to get full access

Overview

This paper explores the use of large language models (LLMs) for providing feedback on programming assignments.
The researchers evaluate the performance of several state-of-the-art LLMs, including GPT-3, CodeBERT, and InstructGPT, on a new benchmark dataset called CodeEditorBench.
The goal is to assess the ability of these LLMs to provide useful feedback to students on their programming assignments, which could potentially improve learning outcomes.

Plain English Explanation

The paper looks at how well large language models, which are AI systems trained on vast amounts of text data, can provide feedback on programming assignments. The researchers created a new dataset called CodeEditorBench that they used to test several state-of-the-art language models, including GPT-3, CodeBERT, and InstructGPT.

The idea is that these powerful language models could potentially be used to give students helpful feedback on their programming assignments, which could in turn improve their learning and help them become better programmers. The researchers wanted to see how well the models performed on this task and identify any areas for improvement.

Technical Explanation

The paper first reviews prior work on using large language models for programming tasks, such as evaluating programming skills and providing automatic scoring and feedback.

The core of the paper is the evaluation of several state-of-the-art language models on the new CodeEditorBench dataset. This dataset consists of programming assignments and corresponding human-written feedback. The researchers fine-tuned the language models on this data and then had them generate feedback on unseen programming assignments.

The models were evaluated on metrics like the quality and usefulness of the generated feedback, as well as how well it aligned with human-written feedback. The results showed that the models, especially CodeBERT and InstructGPT, were able to provide feedback that was reasonably useful and aligned with human feedback.

Critical Analysis

The paper acknowledges several limitations of the current work. The CodeEditorBench dataset, while a valuable resource, is still relatively small compared to the scale of programming assignments used in real-world educational settings. Additionally, the feedback generated by the models, while promising, still falls short of the nuance and context-awareness of human-written feedback.

Further research is needed to improve the models' ability to understand the deeper intentions and reasoning behind programming assignments, and to generate feedback that is more tailored to individual students' needs and learning styles. Incorporating more interactive and iterative feedback mechanisms could also enhance the models' effectiveness.

Conclusion

This paper demonstrates the potential of large language models to provide useful feedback on programming assignments, which could ultimately improve student learning and programming skills. The results are promising, but there is still room for improvement, particularly in capturing the full depth and context of programming tasks and feedback.

As language models continue to advance, integrating them into educational tools and platforms could revolutionize the way programming is taught and learned. However, careful consideration of the models' limitations and further research into more sophisticated feedback generation are crucial to realizing the full potential of this technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Paul Denny

Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings.

5/9/2024

cs.CL cs.AI cs.CY

💬

Evaluation of the Programming Skills of Large Language Models

Luc Bryan Heitz, Joun Chamas, Christopher Scherb

The advent of Large Language Models (LLM) has revolutionized the efficiency and speed with which tasks are completed, marking a significant leap in productivity through technological innovation. As these chatbots tackle increasingly complex tasks, the challenge of assessing the quality of their outputs has become paramount. This paper critically examines the output quality of two leading LLMs, OpenAI's ChatGPT and Google's Gemini AI, by comparing the quality of programming code generated in both their free versions. Through the lens of a real-world example coupled with a systematic dataset, we investigate the code quality produced by these LLMs. Given their notable proficiency in code generation, this aspect of chatbot capability presents a particularly compelling area for analysis. Furthermore, the complexity of programming code often escalates to levels where its verification becomes a formidable task, underscoring the importance of our study. This research aims to shed light on the efficacy and reliability of LLMs in generating high-quality programming code, an endeavor that has significant implications for the field of software development and beyond.

5/24/2024

cs.SE cs.CL cs.CR

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

6/13/2024

cs.CL cs.HC cs.SE

Investigating Automatic Scoring and Feedback using Large Language Models

Gloria Ashiya Katuka, Alexander Gain, Yen-Yun Yu

Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.

5/2/2024

cs.CL cs.LG