DebugBench: Evaluating Debugging Capability of Large Language Models

Read original: arXiv:2401.04621 - Published 6/7/2024 by Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu and 1 other

💬

Overview

Researchers investigate the debugging capability of large language models (LLMs), which is a critical component of programming proficiency.
Previous evaluations of LLMs' debugging ability are limited by data leakage risks, dataset scale, and bug variety.
The researchers introduce DebugBench, a new LLM debugging benchmark with 4,253 instances covering various bug categories and programming languages.
They evaluate the debugging performance of both commercial and open-source LLMs in a zero-shot scenario.

Plain English Explanation

Large language models (LLMs) have shown impressive coding capabilities, but their ability to debug code has not been explored as much. Previous evaluations of LLMs' debugging skills had some limitations, such as the risk of data leakage, the size of the dataset, and the types of bugs tested.

To address these shortcomings, the researchers created a new benchmark called DebugBench. This benchmark consists of 4,253 instances that cover four major bug categories and 18 minor bug types in C++, Java, and Python. They collected code snippets from the LeetCode community, added bugs to the code using GPT-4, and thoroughly checked the quality of the dataset.

The researchers then evaluated the debugging performance of two commercial and four open-source LLMs in a zero-shot scenario, which means the models were not trained specifically for the task. They found that while the closed-source models performed better than the open-source ones, all the models had lower pass rates compared to humans. The complexity of the bugs also affected the models' debugging performance, with some bug categories being more challenging than others. Additionally, the researchers discovered that incorporating runtime feedback can have a mixed impact on the models' debugging abilities.

As an extension, the researchers also compared the LLMs' debugging and code generation capabilities, and found a strong correlation between the two for the closed-source models.

Overall, this research provides valuable insights into the debugging capabilities of LLMs, which can help in the development of more robust and capable language models for programming tasks.

Technical Explanation

The researchers introduce DebugBench, a new benchmark for evaluating the debugging capability of large language models (LLMs). This benchmark consists of 4,253 instances that cover four major bug categories and 18 minor bug types in C++, Java, and Python. To construct DebugBench, the researchers collected code snippets from the LeetCode community and used GPT-4 to introduce bugs into the source data. They then performed rigorous quality checks to ensure the validity of the dataset.

The researchers evaluated the debugging performance of two commercial and four open-source LLMs in a zero-shot scenario, which means the models were not trained specifically for the debugging task. They found that while the closed-source models exhibited superior debugging performance compared to the open-source models, all the models had lower pass rates compared to human performance.

The researchers also found that the complexity of the bugs significantly affected the models' debugging performance, with some bug categories being more challenging than others. Additionally, they observed that incorporating runtime feedback can have a mixed impact on the models' debugging abilities, sometimes helping and sometimes not.

As an extension, the researchers compared the LLMs' debugging and code generation capabilities, revealing a strong correlation between the two for the closed-source models.

Critical Analysis

The researchers have made a valuable contribution to the field by introducing DebugBench, a comprehensive benchmark for evaluating the debugging capabilities of LLMs. This benchmark addresses the limitations of previous evaluations, such as the risk of data leakage, the scale of the dataset, and the variety of tested bugs.

However, the researchers acknowledged that their study is limited to a zero-shot scenario, and it would be interesting to see how the models' performance might improve with fine-tuning or additional training on debugging tasks. Additionally, the researchers did not provide detailed information about the specific commercial and open-source models they evaluated, which makes it difficult to draw more nuanced conclusions about the factors that contribute to better debugging performance.

While the researchers found that incorporating runtime feedback can have a mixed impact on the models' debugging abilities, they did not delve deeper into the underlying reasons for this observation. Further investigation into the mechanisms and limitations of using runtime feedback for debugging tasks could yield additional insights.

Moreover, the researchers' comparison of LLMs' debugging and code generation capabilities is intriguing, but more research is needed to fully understand the relationship between these two key programming skills. Exploring this connection in the context of different model architectures and training approaches could provide a more comprehensive understanding of the strengths and weaknesses of LLMs in programming tasks.

Conclusion

The introduction of DebugBench by the researchers is a significant step forward in evaluating the debugging capabilities of large language models (LLMs). This benchmark addresses the limitations of previous evaluations and provides a more comprehensive and reliable way to assess this critical component of programming proficiency.

The researchers' findings offer valuable insights into the current state of LLMs' debugging performance, highlighting the differences between commercial and open-source models, as well as the impact of bug complexity and runtime feedback. These insights can inform the development of more robust and capable language models for programming tasks, ultimately enhancing the overall capabilities of LLMs in the field of software engineering.

As the research in this area continues to evolve, further exploration of topics such as fine-tuning, the impact of model architecture, and the relationship between debugging and code generation could lead to even deeper understanding and more significant advancements in the debugging capabilities of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

6/7/2024

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

4/9/2024

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, Ge Yu

Debugging is a vital aspect of software development, yet the debugging capabilities of Large Language Models (LLMs) remain largely unexplored. This paper first introduces DEBUGEVAL, a comprehensive benchmark designed to evaluate the debugging capabilities of LLMs. DEBUGEVAL collects data from existing high-quality datasets and designs four different tasks to evaluate the debugging effectiveness, including BUG Localization, BUG Identification, Code Review, and Code Repair. Additionally, to enhance the code debugging ability of LLMs, this paper proposes a CoMmunicative Agent BaSed DaTa REfinement FRamework (MASTER), which generates the refined code debugging data for supervised finetuning. Specifically, MASTER employs the Code Quizzer to generate refined data according to the defined tasks of DEBUGEVAL. Then the Code Learner acts as a critic and reserves the generated problems that it can not solve. Finally, the Code Teacher provides a detailed Chain-of-Thought based solution to deal with the generated problem. We collect the synthesized data and finetune the Code Learner to enhance the debugging ability and conduct the NeuDebugger model. Our experiments evaluate various LLMs and NeuDebugger in the zero-shot setting on DEBUGEVAL. Experimental results demonstrate that these 7B-scale LLMs have weaker debugging capabilities, even these code-oriented LLMs. On the contrary, these larger models (over 70B) show convincing debugging ability. Our further analyses illustrate that MASTER is an effective method to enhance the code debugging ability by synthesizing data for Supervised Fine-Tuning (SFT) LLMs.

8/12/2024

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of these existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and four popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. Furthermore, to better understand the performance of LLMs in real-world projects, we manually created a real-world benchmark comprising 140 code generation tasks. Our analysis highlights distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems.

7/9/2024