Don't Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems

Read original: arXiv:2209.05948 - Published 8/12/2024 by Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Mingze Ni, Li Li, David Lo

🧠

Overview

Large pre-trained language models are widely used in neural code completion systems.
However, around 70% of code completions generated by these models are not accepted by developers.
This means the models' assistance to developer productivity is limited, and may even increase developers' workload.
Additionally, the high cost of running these large models is a significant waste of computing resources and energy, which goes against the principle of sustainable AI development.

Plain English Explanation

The paper focuses on a problem with large pre-trained language models used for neural code completion. These models can generate code completions as developers type, but a large portion of the completions (around 70%) are not actually accepted by the developers. This means the models are not providing very helpful assistance, and may even be making the developers' jobs more difficult by generating unhelpful suggestions.

The high cost of running these large models is also a significant problem, as it wastes a lot of computing resources and energy. This goes against the goal of developing AI technologies in a sustainable way.

The researchers wanted to understand why the models are generating so many unhelpful code completions, and find a way to prevent this from happening in a more cost-effective manner.

Technical Explanation

The researchers first investigated the "prompts" (the initial code or text that the model uses to generate the completion) that led to unhelpful code completions. They identified four key patterns in these "low-return" prompts, which suggested that the issues could not be easily addressed by just improving the model's accuracy.

Motivated by this finding, the researchers proposed an "early-rejection" mechanism to identify and reject the low-return prompts before sending them to the large language model. This would prevent the model from generating unhelpful completions, saving computing resources and energy.

The researchers explored five different types of "estimators" (predictive models) to evaluate the quality of the code completions and determine which prompts to reject. Their experiments showed that one of the estimators could reject 20% of the code completion requests with a 97.4% precision, meaning it was very accurate at identifying the low-return prompts.

Critical Analysis

The paper does a thorough job of identifying and quantifying the problem of unhelpful code completions generated by large language models. The researchers' analysis of the low-return prompts provides valuable insights into the underlying issues, which go beyond simply improving the model's accuracy.

The proposed early-rejection mechanism is a promising solution, as it could significantly reduce the waste of computing resources and energy without severely impacting the overall code completion service. However, the paper does not explore the potential impact on developer productivity or user experience, which would be important to consider in a real-world deployment.

Additionally, the researchers note that the estimators used in the early-rejection mechanism were trained on a limited dataset, so further research may be needed to ensure the approach generalizes well to a wider range of code completion scenarios.

Conclusion

This paper highlights an important, yet overlooked, issue with the widespread use of large pre-trained language models in neural code completion systems. By identifying the patterns in low-return prompts and proposing an early-rejection mechanism, the researchers have taken a significant step towards addressing the waste and inefficiency of these models.

The findings and proposed solution have implications for the sustainable development of AI technologies, as well as the overall productivity and user experience of developers using code completion tools. Further research and development in this area could lead to more efficient and effective neural code completion systems that better serve the needs of software developers.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Don't Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems

Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Mingze Ni, Li Li, David Lo

Currently, large pre-trained language models are widely applied in neural code completion systems. Though large code models significantly outperform their smaller counterparts, around 70% of displayed code completions from Github Copilot are not accepted by developers. Being reviewed but not accepted, their help to developer productivity is considerably limited and may conversely aggravate the workload of developers, as the code completions are automatically and actively generated in state-of-the-art code completion systems as developers type out once the service is enabled. Even worse, considering the high cost of the large code models, it is a huge waste of computing resources and energy, which severely goes against the sustainable development principle of AI technologies. However, such waste has never been realized, not to mention effectively addressed, in the research community for neural code completion. Hence, preventing such unhelpful code completions from happening in a cost-friendly way is of urgent need. To fill this significant gap, we first investigate the prompts of unhelpful code completions, called low-return prompts. We empirically identify four observable patterns in low-return prompts, each lacking necessary information, making it difficult to address through enhancements to the model's accuracy alone. This demonstrates the feasibility of identifying such low-return prompts based on the prompts themselves. Motivated by this finding, we propose an early-rejection mechanism to turn down low-return prompts by foretelling the code completion qualities. The prompts that are estimated to receive unhelpful code completions will not be sent to the model. Furthermore, we investigated five types of estimators to demonstrate the feasibility of the mechanism. The experimental results show that the estimator can reject 20% of code completion requests with a 97.4% Precision.

8/12/2024

🛠️

A Transformer-Based Approach for Smart Invocation of Automatic Code Completion

Aral de Moor, Arie van Deursen, Maliheh Izadi

Transformer-based language models are highly effective for code completion, with much research dedicated to enhancing the content of these completions. Despite their effectiveness, these models come with high operational costs and can be intrusive, especially when they suggest too often and interrupt developers who are concentrating on their work. Current research largely overlooks how these models interact with developers in practice and neglects to address when a developer should receive completion suggestions. To tackle this issue, we developed a machine learning model that can accurately predict when to invoke a code completion tool given the code context and available telemetry data. To do so, we collect a dataset of 200k developer interactions with our cross-IDE code completion plugin and train several invocation filtering models. Our results indicate that our small-scale transformer model significantly outperforms the baseline while maintaining low enough latency. We further explore the search space for integrating additional telemetry data into a pre-trained transformer directly and obtain promising results. To further demonstrate our approach's practical potential, we deployed the model in an online environment with 34 developers and provided real-world insights based on 74k actual invocations.

5/24/2024

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Yao Wan, Guanghua Wan, Shijie Zhang, Hongyu Zhang, Pan Zhou, Hai Jin, Lichao Sun

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving ampler space for further improvement. Finally, we also try to explain the findings from the perspective of model memorization.

9/10/2024

Optimizing Large Language Models for OpenAPI Code Completion

Bohdan Petryshyn, Mantas Lukov{s}eviv{c}ius

Recent advancements in Large Language Models (LLMs) and their utilization in code generation tasks have significantly reshaped the field of software development. Despite the remarkable efficacy of code completion solutions in mainstream programming languages, their performance lags when applied to less ubiquitous formats such as OpenAPI definitions. This study evaluates the OpenAPI completion performance of GitHub Copilot, a prevalent commercial code completion tool, and proposes a set of task-specific optimizations leveraging Meta's open-source model Code Llama. A semantics-aware OpenAPI completion benchmark proposed in this research is used to perform a series of experiments through which the impact of various prompt-engineering and fine-tuning techniques on the Code Llama model's performance is analyzed. The fine-tuned Code Llama model reaches a peak correctness improvement of 55.2% over GitHub Copilot despite utilizing 25 times fewer parameters than the commercial solution's underlying Codex model. Additionally, this research proposes an enhancement to a widely used code infilling training technique, addressing the issue of underperformance when the model is prompted with context sizes smaller than those used during training. The dataset, the benchmark, and the model fine-tuning code are made publicly available.

6/12/2024