Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

2405.01573

Published 6/6/2024 by Ajinkya Deshpande, Anmol Agarwal, Shashank Shet, Arun Iyer, Aditya Kanade, Ramakrishna Bairi, Suresh Parthasarathy

cs.SE cs.AI

🛸

Abstract

LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes Natural Language to Class generation tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.

Create account to get full access

Overview

The paper explores the challenges of using large language models (LLMs) for generating complex, class-level code within real-world software repositories.
It introduces a new benchmark, RepoClassBench, that evaluates LLMs on natural language to class generation tasks across Java and Python, incorporating repository-level dependencies and test cases.
The paper finds that current models struggle with the realistic challenges posed by the benchmark, and proposes a novel approach called Retrieve-Repotools-Reflect (RRR) to enhance LLMs' understanding of repository context.

Plain English Explanation

Large language models (LLMs) have shown promising results in generating code at the function or statement level. However, the complexities of creating entire classes of code, especially within the context of real-world software projects, remain underexplored.

The researchers created a new benchmark called RepoClassBench to evaluate how well LLMs can generate complex, class-level code within the context of actual software repositories. This benchmark includes natural language to class generation tasks in both Java and Python, and ensures that each class has dependencies on other files in the repository and includes corresponding test cases.

The researchers found that current LLMs struggle with the realistic challenges posed by RepoClassBench, primarily because these models have limited exposure to the relevant repository-level context. To address this, the researchers developed a new approach called Retrieve-Repotools-Reflect (RRR). RRR equips LLMs with static analysis tools to help them better navigate and reason about the repository context in an iterative, agent-based framework.

The experiments showed that RRR significantly outperforms existing baselines on RepoClassBench, demonstrating its effectiveness across programming languages and in various settings. This research highlights the importance of benchmarks that incorporate repository-level dependencies to more accurately reflect the complexities of real-world software development. It also illustrates the benefits of leveraging specialized tools to enhance LLMs' understanding of the software development environment.

Technical Explanation

The paper introduces RepoClassBench, a new benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world software repositories. Unlike previous research that often treats class-level generation as an isolated task, RepoClassBench ensures that each class in the dataset has cross-file dependencies within the repository and includes corresponding test cases to verify its functionality.

The researchers evaluate the performance of current LLMs on the RepoClassBench tasks and find that they struggle with the realistic challenges posed by the benchmark. To address this limitation, the paper proposes a novel approach called Retrieve-Repotools-Reflect (RRR). RRR is an agent-based framework that equips LLMs with static analysis tools to iteratively navigate and reason about the repository-level context.

The experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages (Java and Python) and in various settings. This work emphasizes the need for benchmarks that incorporate repository-level dependencies to more accurately reflect the complexities of software development, and illustrates the benefits of leveraging specialized tools to enhance LLMs' understanding of the software development environment.

Critical Analysis

The paper acknowledges that while LLMs have shown promising results in code generation tasks at the function or statement level, the complexities associated with creating class-level code artifacts within real-world software repositories remain underexplored.

The introduction of RepoClassBench is a valuable contribution, as it provides a more realistic and challenging benchmark for evaluating LLMs in class-level code generation tasks. By incorporating repository-level dependencies and test cases, the benchmark better reflects the complexities of software development than previous approaches that treated class-level generation as an isolated task.

However, the paper does not delve into potential limitations or caveats of the RepoClassBench dataset itself. For example, it would be useful to understand the diversity and representativeness of the selected repositories, as well as any potential biases or idiosyncrasies in the dataset that could impact the generalizability of the results.

Additionally, while the Retrieve-Repotools-Reflect (RRR) approach demonstrates promising performance on the benchmark, the paper could have provided more details on the specific static analysis tools used and how they were integrated into the agent-based framework. This information would help readers better understand the technical implementation and potential areas for further refinement or extension.

Overall, the research presented in this paper is a valuable step towards addressing the challenges of using LLMs for complex, class-level code generation in real-world software development contexts. The introduction of RepoClassBench and the RRR approach represent important advancements in the field, and the findings highlight the need for continued research in this area.

Conclusion

This paper tackles the challenge of using large language models (LLMs) for generating complex, class-level code within the context of real-world software repositories. The researchers introduce a new benchmark, RepoClassBench, which evaluates LLMs on natural language to class generation tasks across Java and Python, incorporating repository-level dependencies and test cases.

The paper's key findings demonstrate that current LLMs struggle with the realistic challenges posed by the RepoClassBench benchmark, primarily due to their limited exposure to relevant repository-level context. To address this limitation, the researchers propose a novel approach called Retrieve-Repotools-Reflect (RRR), which equips LLMs with static analysis tools to enhance their understanding of the software development environment.

The experiments show that RRR significantly outperforms existing baselines on RepoClassBench, underscoring the importance of benchmarks that incorporate repository-level dependencies to more accurately reflect the complexities of real-world software development. This research also illustrates the potential benefits of leveraging specialized tools to improve LLMs' capabilities in complex, context-dependent tasks, such as class-level code generation within software repositories.

The findings of this paper have important implications for the continued development and application of LLMs in software engineering tasks, as well as the design of more realistic and representative benchmarks for evaluating these models' performance in realistic, real-world settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Enhancing Repository-Level Code Generation with Integrated Contextual Information

Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, repository-level code generation presents unique challenges, particularly due to the need to utilize information spread across multiple files within a repository. Existing retrieval-based approaches sometimes fall short as they are limited in obtaining a broader and deeper repository context. In this paper, we present CatCoder, a novel code generation framework designed for statically typed programming languages. CatCoder enhances repository-level code generation by integrating relevant code and type context. Specifically, it leverages static analyzers to extract type dependencies and merges this information with retrieved code to create comprehensive prompts for LLMs. To evaluate the effectiveness of CatCoder, we adapt and construct benchmarks that include 199 Java tasks and 90 Rust tasks. The results show that CatCoder outperforms the RepoCoder baseline by up to 17.35%, in terms of pass@k score. Furthermore, the generalizability of CatCoder is assessed using various LLMs, including both code-specialized models and general-purpose models. Our findings indicate consistent performance improvements across all models, which underlines the practicality of CatCoder.

6/6/2024

cs.SE cs.AI

💬

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.

6/19/2024

cs.CL cs.AI

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~url{https://github.com/FSoft-AI4Code/RepoExec}.

6/21/2024

cs.SE cs.AI

Code Agents are State of the Art Software Testers

Niels Mundler, Mark Niklas Muller, Jingxuan He, Martin Vechev

Rigorous software testing is crucial for developing and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address this gap and investigate the capability of LLM-based Code Agents for formalizing user issues into test cases. To this end, we propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth patches, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using fail-to-pass rate and coverage metrics, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-Agent.

6/21/2024

cs.SE cs.AI cs.LG