MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

2403.17927

Published 6/28/2024 by Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, Yu Cheng

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Abstract

In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.

Create account to get full access

Overview

Introduces a new framework called MAGIS (LLM-Based Multi-Agent Framework for GitHub Issue ReSolution) for resolving issues on the GitHub platform
Leverages large language models (LLMs) and a multi-agent system to automate the GitHub issue resolution process
Performs an empirical study to evaluate the effectiveness of MAGIS compared to other approaches

Plain English Explanation

MAGIS is a new system that aims to make it easier to resolve issues on the GitHub platform, which is a popular website used by software developers to collaborate on code projects. The key idea behind MAGIS is to use large language models - powerful AI models that can understand and generate human-like text - to automate parts of the issue resolution process.

Traditionally, when a software developer encounters a problem or "issue" with a project on GitHub, they have to manually describe the issue, interact with other developers to understand and fix it, and then provide a solution. MAGIS aims to streamline this process by using a team of AI "agents" that can work together to understand the issue, propose solutions, and coordinate the overall resolution workflow.

The researchers behind MAGIS conducted an empirical study to evaluate how well their system performs compared to other approaches. They found that MAGIS was able to resolve GitHub issues more effectively and efficiently than existing methods, suggesting that this type of multi-agent AI framework could be a useful tool for software development teams.

Technical Explanation

The core of the MAGIS framework is a multi-agent system that leverages large language models (LLMs) to automate various tasks involved in resolving GitHub issues. The system consists of several specialized agents, each responsible for a different aspect of the issue resolution process:

Issue Analyzer: Understands the content and context of a new GitHub issue by analyzing its text, code snippets, and other relevant information.
Solution Generator: Proposes potential solutions to the issue by generating relevant code changes, documentation updates, or other remedies.
Coordination Manager: Oversees the overall workflow, facilitating communication and collaboration between the other agents.
Feedback Integrator: Incorporates feedback and comments from human developers to iteratively improve the proposed solutions.

The researchers conducted an empirical evaluation of MAGIS on a diverse set of GitHub issues, comparing its performance to several baseline approaches, including a rule-based system and a single-agent LLM-based system. Their results showed that MAGIS was able to resolve issues more accurately and efficiently than the other methods, demonstrating the potential benefits of a domain-specific, multi-agent AI framework for software development tasks.

Critical Analysis

The MAGIS paper presents a promising approach to automating GitHub issue resolution, but it also acknowledges several limitations and areas for further research. One key concern is the reliance on large language models, which can be opaque and difficult to interpret, potentially making it challenging to understand and debug the system's decision-making process.

Additionally, the empirical evaluation focused on a limited set of GitHub issues, and it's unclear how well MAGIS would perform on a broader range of problems or in real-world production environments. The researchers also note the need for more advanced techniques to handle complex, multi-step solutions, as well as the potential for bias and fairness issues in the system's outputs.

Overall, the MAGIS framework represents an interesting step forward in the application of multi-agent AI systems to software engineering tasks. However, further research and development will be needed to address the system's current limitations and fully realize its potential benefits for GitHub issue resolution and beyond.

Conclusion

The MAGIS framework proposes a novel approach to automating the resolution of GitHub issues by leveraging large language models and a multi-agent system. The empirical study conducted by the researchers suggests that this type of domain-specific AI framework can outperform existing methods, highlighting the potential for AI-powered tools to enhance software development workflows.

While MAGIS shows promise, the paper also identifies several areas for improvement, such as addressing the interpretability and scalability of the system. As the field of AI continues to advance, solutions like MAGIS may become increasingly valuable for software teams, helping to streamline the issue resolution process and free up developers to focus on more complex and creative tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Code Agents are State of the Art Software Testers

Niels Mundler, Mark Niklas Muller, Jingxuan He, Martin Vechev

Rigorous software testing is crucial for developing and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address this gap and investigate the capability of LLM-based Code Agents for formalizing user issues into test cases. To this end, we propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth patches, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using fail-to-pass rate and coverage metrics, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-Agent.

6/21/2024

cs.SE cs.AI cs.LG

💬

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.

6/19/2024

cs.CL cs.AI

🎯

Domain-specific ReAct for physics-integrated iterative modeling: A case study of LLM agents for gas path analysis of gas turbines

Tao Song, Yuwei Fan, Chenlong Feng, Keyu Song, Chao Liu, Dongxiang Jiang

This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and parameter extraction, while larger models demonstrated favorable capabilities. All models faced challenges with complex, multi-component problems. Based on the test results, we infer that LLMs with nearly 100 billion parameters could meet professional scenario requirements with fine-tuning and advanced prompt design. Continued development are likely to enhance their accuracy and effectiveness, paving the way for more robust AI-driven solutions.

6/13/2024

cs.AI cs.CE cs.LG

🎯

Can Github issues be solved with Tree Of Thoughts?

Ricardo La Rosa, Corey Hulse, Bangdi Liu

While there have been extensive studies in code generation by large language models (LLM), where benchmarks like HumanEval have been surpassed with an impressive 96.3% success rate, these benchmarks predominantly judge a model's performance on basic function-level code generation and lack the critical thinking and concept of scope required of real-world scenarios such as solving GitHub issues. This research introduces the application of the Tree of Thoughts (ToT) language model reasoning framework for enhancing the decision-making and problem-solving abilities of LLMs for this complex task. Compared to traditional input-output (IO) prompting and Retrieval Augmented Generation (RAG) techniques, ToT is designed to improve performance by facilitating a structured exploration of multiple reasoning trajectories and enabling self-assessment of potential solutions. We experimentally deploy ToT in tackling a Github issue contained within an instance of the SWE-bench. However, our results reveal that the ToT framework alone is not enough to give LLMs the critical reasoning capabilities to outperform existing methods. In this paper we analyze the potential causes of these shortcomings and identify key areas for improvement such as deepening the thought process and introducing agentic capabilities. The insights of this research are aimed at informing future directions for refining the application of ToT and better harnessing the potential of LLMs in real-world problem-solving scenarios.

5/24/2024

cs.SE cs.AI