AutoCodeRover: Autonomous Program Improvement

2404.05427

Published 4/16/2024 by Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, Abhik Roychoudhury

AutoCodeRover: Autonomous Program Improvement

Abstract

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. bug fixing) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM's understanding of the issue's root cause, and effectively retrieve a context via iterative search. The use of spectrum based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on SWE-bench-lite which consists of 300 real-life GitHub issues show increased efficacy in solving GitHub issues (22-23% on SWE-bench-lite). On the full SWE-bench consisting of 2294 GitHub issues, AutoCodeRover solved around 16% of issues, which is higher than the efficacy of the recently reported AI software engineer Devin from Cognition Labs, while taking time comparable to Devin. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces a novel system called "AutoCodeRover" that aims to autonomously improve program code.
The system leverages large language models and other AI techniques to automatically detect and fix issues in code, with the goal of enhancing software quality and developer productivity.
Key aspects of the research include advances in program repair, the use of multi-agent systems, and techniques for synthesizing code specifications from natural language.

Plain English Explanation

The paper describes a new AI-powered system called AutoCodeRover that can automatically improve computer programs. Instead of developers having to manually fix bugs or enhance code, this system uses advanced machine learning models to detect problems and make improvements on its own.

The core idea is to combine large language models, which are AI systems trained on massive amounts of text data, with techniques for program repair and code synthesis. The language models allow the system to understand the intended purpose and functionality of the code, while the repair and synthesis components can then identify issues and generate updated, higher-quality code.

This automation could help save developers a significant amount of time and effort, and ultimately lead to software that is more reliable and performant. By leveraging the complementary strengths of language models and other AI approaches, AutoCodeRover aims to push the boundaries of what's possible in autonomous programming.

Technical Explanation

The paper first provides background on the challenges of program repair and the limitations of existing approaches. It then introduces the AutoCodeRover system, which uses a multi-agent architecture to tackle these problems.

At the core of the system are large language models that have been trained on vast amounts of code and natural language data. These models allow AutoCodeRover to understand the intended functionality and semantics of the input programs. The system also incorporates specialized agents for tasks like code synthesis, defect detection, and program transformation.

Through coordination between these agents, AutoCodeRover can automatically identify issues in the input code, generate candidate fixes, validate the fixes, and then apply the improvements. The authors demonstrate the system's capabilities on a range of benchmarks, showing substantial improvements in code quality and developer productivity.

Critical Analysis

The paper presents a compelling vision for automating program improvement, but it also acknowledges several key limitations and areas for future work. For example, the current system is focused on relatively small, isolated programs, and scaling it to handle large, complex codebases will require significant additional research.

Additionally, while the language models provide powerful semantic understanding, they may struggle with the more formal, mathematical aspects of program logic. Integrating AutoCodeRover with specialized formal methods and verification techniques could help address this challenge.

There are also open questions around the robustness and trustworthiness of the system's outputs. Developers will need to have confidence that the automatically generated improvements are correct and do not introduce new issues. Further research on techniques like self-organized agents and adversarial testing could help address these concerns.

Conclusion

Overall, the AutoCodeRover system represents a promising step towards more autonomous and intelligent software development. By combining the strengths of large language models and other AI approaches, the researchers have demonstrated a novel way to automate the tedious and error-prone task of program improvement.

While significant challenges remain, the potential benefits of this technology are substantial. If realized, AutoCodeRover could dramatically enhance developer productivity, software quality, and the pace of innovation in the software industry. The research lays the groundwork for a future where AI-powered systems actively collaborate with human developers to create better software, faster.

Related Papers

💬

Automatic Programming: Large Language Models and Beyond

Michael R. Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, Patanamon Thongtanunam

Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related issues of programmer responsibility. These are key issues for organizations while deciding on the usage of automatically generated code. We discuss how advances in software engineering such as program repair and analysis can enable automatic programming. We conclude with a forward looking view, focusing on the programming environment of the near future, where programmers may need to switch to different roles to fully utilize the power of automatic programming. Automated repair of automatically generated programs from LLMs, can help produce higher assurance code from LLMs, along with evidence of assurance

5/6/2024

cs.SE cs.AI cs.LG

💬

Large Language Models Synergize with Automated Machine Learning

Jinglue Xu, Zhen Liu, Nagar Anthel Venkatesh Suryanarayanan, Hitoshi Iba

Recently, code generation driven by large language models (LLMs) has become increasingly popular. However, automatically generating code for machine learning (ML) tasks still poses significant challenges. This paper explores the limits of program synthesis for ML by combining LLMs and automated machine learning (autoML). Specifically, our goal is to fully automate the code generation process for the entire ML workflow, from data preparation to modeling and post-processing, utilizing only textual descriptions of the ML tasks. To manage the length and diversity of ML programs, we propose to break each ML program into smaller, manageable parts. Each part is generated separately by the LLM, with careful consideration of their compatibilities. To implement the approach, we design a testing technique for ML programs. Furthermore, our approach enables integration with autoML. In our approach, autoML serves to numerically assess and optimize the ML programs generated by LLMs. LLMs, in turn, help to bridge the gap between theoretical, algorithm-centered autoML and practical autoML applications. This mutual enhancement underscores the synergy between LLMs and autoML in program synthesis for ML. In experiments across various ML tasks, our method outperforms existing methods in 10 out of 12 tasks for generating ML programs. In addition, autoML significantly improves the performance of the generated ML programs. In the experiments, our method, Text-to-ML, achieves fully automated synthesis of the entire ML pipeline based solely on textual descriptions of the ML tasks.

5/8/2024

cs.SE cs.AI cs.LG cs.PL

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

4/26/2024

cs.FL cs.AI cs.LG cs.LO

Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments

Qianhui Zhao, Fang Liu, Li Zhang, Yang Liu, Zhen Yan, Zhenghao Chen, Yufei Zhou, Jing Jiang, Ge Li

Automated generation of feedback on programming assignments holds significant benefits for programming education, especially when it comes to advanced assignments. Automated Program Repair techniques, especially Large Language Model based approaches, have gained notable recognition for their potential to fix introductory assignments. However, the programs used for evaluation are relatively simple. It remains unclear how existing approaches perform in repairing programs from higher-level programming courses. To address these limitations, we curate a new advanced student assignment dataset named Defects4DS from a higher-level programming course. Subsequently, we identify the challenges related to fixing bugs in advanced assignments. Based on the analysis, we develop a framework called PaR that is powered by the LLM. PaR works in three phases: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. Peer Solution Selection identifies the closely related peer programs based on lexical, semantic, and syntactic criteria. Then Multi-Source Prompt Generation adeptly combines multiple sources of information to create a comprehensive and informative prompt for the last Program Repair stage. The evaluation on Defects4DS and another well-investigated ITSP dataset reveals that PaR achieves a new state-of-the-art performance, demonstrating impressive improvements of 19.94% and 15.2% in repair rate compared to prior state-of-the-art LLM- and symbolic-based approaches, respectively

4/3/2024

cs.SE cs.AI