Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM

2402.00097

Published 4/4/2024 by Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, Baishakhi Ray

cs.SE cs.LG

Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM

Abstract

Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent works using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage. In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt's approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.

Create account to get full access

Overview

The paper explores using large language models (LLMs) for automated test generation in a regression setting, where the goal is to generate new test cases to find bugs in updated software.
The proposed approach, called "Code-Aware Prompting," leverages the code structure and coverage information to guide the LLM in generating more effective test cases.
The researchers conducted experiments to evaluate the performance of Code-Aware Prompting compared to existing test generation methods.

Plain English Explanation

Developing software often involves making changes to fix issues or add new features. When changes are made, it's important to thoroughly test the software to make sure nothing has broken or stopped working as expected. This is called regression testing.

Traditionally, regression testing has been a manual and time-consuming process, where human testers write new test cases by hand. However, this can be a difficult and error-prone task, especially as software becomes more complex.

The researchers in this paper explored using large language models (LLMs), which are AI systems that are trained on vast amounts of text data, to automatically generate new test cases for regression testing. LLMs have shown impressive abilities in tasks like generating human-like text, so the researchers wanted to see if they could also be used to create effective test cases.

The key insight of their approach, called "Code-Aware Prompting," is to provide the LLM with information about the structure and coverage of the code being tested. This helps the LLM understand the software better and generate test cases that are more likely to find bugs.

Imagine you're trying to write a new test case for a calculator app. If you just ask the LLM to "write a test case for the calculator app," it might generate a test case that doesn't actually cover important functionality, like handling negative numbers or division by zero. But if you also provide information about the code structure and what parts of the code have already been tested, the LLM can use that context to generate a more targeted and effective test case.

Technical Explanation

The paper presents a novel approach called "Code-Aware Prompting" for using LLMs to generate test cases in a regression testing setting. The key idea is to provide the LLM with not just the task of generating a test case, but also information about the structure and coverage of the code being tested.

Specifically, the researchers used a pre-trained LLM and fine-tuned it on a dataset of existing test cases and code coverage information. This allowed the LLM to learn the relationship between code structure, coverage, and effective test cases.

During the test generation process, the LLM is provided with a prompt that includes the function under test, its source code, and the coverage information for that function. The LLM then generates a new test case that aims to improve the overall code coverage.

The researchers evaluated their approach on several software benchmarks and compared it to traditional test generation methods, as well as other LLM-based approaches that don't use the code-aware prompting technique. The results showed that Code-Aware Prompting outperformed the baselines in terms of the quality and coverage of the generated test cases.

Critical Analysis

The paper presents a promising approach to leveraging LLMs for automated test case generation, and the results demonstrate the value of incorporating code structure and coverage information into the prompting process. However, there are a few potential limitations and areas for further research:

The experiments were conducted on relatively small-scale software benchmarks, so it's unclear how the approach would scale to larger, more complex codebases in real-world settings.
The paper does not explore the generalization capabilities of the trained LLM model - it's unclear how well the approach would work on unfamiliar codebases or programming languages.
The paper does not discuss the computational cost and time required for the fine-tuning and test generation process, which could be an important practical consideration.
The paper focuses on generating test cases, but does not address the problem of automatically verifying and assessing the generated tests, which is another key challenge in test automation.

Overall, the Code-Aware Prompting approach represents an interesting and promising direction for leveraging LLMs in software engineering tasks. Further research is needed to understand the broader applicability and practical considerations of this technique.

Conclusion

This paper presents a novel approach called "Code-Aware Prompting" that uses large language models to automatically generate effective test cases for regression testing. By incorporating information about the code structure and coverage, the LLM is able to generate more targeted and useful test cases compared to traditional methods.

The results of the experiments demonstrate the potential of this approach to improve the efficiency and effectiveness of regression testing, which is a crucial task in software development. While there are still some limitations and areas for further research, this work represents an important step towards leveraging the power of large language models to automate and enhance software engineering processes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Validating LLM-Generated Programs with Metamorphic Prompt Testing

Xiaoyin Wang, Dakai Zhu

The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.

6/12/2024

cs.SE cs.AI

🛸

Quality Assessment of Prompts Used in Code Generation

Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, Joanna C. S. Santos

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code-generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers' prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers' intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models possibly have data contamination issues.

4/17/2024

cs.SE cs.LG

🛠️

PromptWizard: Task-Aware Agent-driven Prompt Optimization Framework

Eshaan Agarwal, Vivek Dani, Tanuja Ganu, Akshay Nambi

Large language models (LLMs) have revolutionized AI across diverse domains, showcasing remarkable capabilities. Central to their success is the concept of prompting, which guides model output generation. However, manual prompt engineering is labor-intensive and domain-specific, necessitating automated solutions. This paper introduces PromptWizard, a novel framework leveraging LLMs to iteratively synthesize and refine prompts tailored to specific tasks. Unlike existing approaches, PromptWizard optimizes both prompt instructions and in-context examples, maximizing model performance. The framework iteratively refines prompts by mutating instructions and incorporating negative examples to deepen understanding and ensure diversity. It further enhances both instructions and examples with the aid of a critic, synthesizing new instructions and examples enriched with detailed reasoning steps for optimal performance. PromptWizard offers several key features and capabilities, including computational efficiency compared to state-of-the-art approaches, adaptability to scenarios with varying amounts of training data, and effectiveness with smaller LLMs. Rigorous evaluation across 35 tasks on 8 datasets demonstrates PromptWizard's superiority over existing prompt strategies, showcasing its efficacy and scalability in prompt optimization.

5/29/2024

cs.CL cs.AI cs.LG

📉

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri

Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.

4/5/2024

cs.CL