Magicoder: Empowering Code Generation with OSS-Instruct

Read original: arXiv:2312.02120 - Published 6/10/2024 by Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang

128

🛸

Overview

Magicoder is a series of open-source Large Language Models (LLMs) for code that rival top code models while having just 7B parameters or fewer.
The models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach that uses open-source code snippets to generate diverse training data.
The goal is to mitigate bias in synthetic data by leveraging the wealth of open-source references to produce more realistic and controllable data.
The Magicoder models, including the enhanced MagicoderS, substantially outperform state-of-the-art code models on a wide range of coding benchmarks, even surpassing ChatGPT on certain tasks.
OSS-Instruct opens a new direction for creating diverse synthetic instruction data for code using open-source resources.

Plain English Explanation

Magicoder is a collection of powerful AI models for writing code that are freely available for anyone to use. These models are trained on a large amount of synthetic (artificially created) data using a new approach called OSS-Instruct. OSS-Instruct taps into the wealth of open-source code snippets online to generate diverse and realistic training data for the models.

The key idea is to overcome the inherent biases that can creep into synthetic data generated by AI models. By drawing on the vast repository of real-world code examples, Magicoder models can learn to write code that is more natural and applicable to real-world problems.

Despite being smaller in size compared to other top code models, the Magicoder series, especially the enhanced MagicoderS, outperforms these larger models on a wide range of coding tasks. In fact, one of the Magicoder models even surpasses the well-known ChatGPT on certain benchmarks.

Overall, the Magicoder project demonstrates a new and promising way to create powerful AI assistants for coding by leveraging the abundance of open-source code available online. This could have significant implications for improving the capabilities of code generation models and harmonizing the elicitation of code capabilities in the future.

Technical Explanation

The researchers introduce Magicoder, a series of open-source Large Language Models (LLMs) for code that rival top code models while having no more than 7 billion parameters. These Magicoder models are trained on 75,000 synthetic instruction data using a novel approach called OSS-Instruct.

OSS-Instruct leverages the wealth of open-source code snippets available online to generate diverse and realistic training data for the models. This is in contrast to more traditional methods of generating synthetic data, which can sometimes lead to inherent biases. By drawing on real-world code examples, the Magicoder models can learn to generate code that is more applicable to practical problems.

The Magicoder series, including the enhanced MagicoderS, substantially outperform state-of-the-art code models on a wide range of coding benchmarks. Notably, the MagicoderS-CL-7B model, which is based on CodeLLaMA, even surpasses the prominent ChatGPT on the HumanEval+ benchmark.

The researchers argue that OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code generation models. This approach could lead to further advancements in training code language models with comprehensive semantics and automating code adaptation through MLOps benchmarking.

Critical Analysis

The Magicoder research presents a promising approach to improving the performance of code generation models while reducing their size and parameters. The use of OSS-Instruct to leverage open-source code snippets is an innovative way to address the potential biases in synthetic data.

However, the paper does not delve deeply into the specific limitations of the Magicoder models or the OSS-Instruct approach. It would be helpful to understand the types of biases or inaccuracies that may still persist in the generated code, even with the use of open-source references.

Additionally, the researchers could explore the potential challenges in scaling the OSS-Instruct approach, such as the curation and processing of a vast amount of open-source code. This could provide valuable insights for the broader research community working on improving code generation capabilities and harmonizing code elicitation.

Overall, the Magicoder research represents an exciting step forward in the development of more accessible and capable code generation models. Continued exploration of this approach, along with a deeper analysis of its limitations and potential challenges, could lead to further advancements in the field.

Conclusion

The Magicoder project introduces a series of open-source Large Language Models for code that rival top models in performance while being significantly smaller in size. By leveraging the wealth of open-source code snippets through the novel OSS-Instruct approach, the researchers have been able to mitigate the inherent biases in synthetic data and produce more realistic and controllable training data.

The Magicoder models, including the enhanced MagicoderS, have demonstrated impressive results on a wide range of coding benchmarks, even surpassing the well-known ChatGPT in certain tasks. This suggests that the OSS-Instruct approach holds promise for advancing the capabilities of code generation models and harmonizing the elicitation of code capabilities more broadly.

As the research community continues to explore the potential of large language models for coding, the Magicoder project serves as an inspiring example of how open-source resources can be leveraged to create powerful and accessible AI assistants for software development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

128

Magicoder: Empowering Code Generation with OSS-Instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang

We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.

6/10/2024

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

7/9/2024

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin

Recent work demonstrates that, after instruction tuning, Code Large Language Models (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs mainly focus on the traditional code generation task, resulting in poor performance in complex multi-task scenarios. In this paper, we concentrate on multiple code-related tasks and present WaveCoder, a series of Code LLMs trained with Widespread And Versatile Enhanced instruction data. To enable the models to tackle complex code-related tasks, we propose a method to stably generate diverse, high-quality instruction data from open source code dataset in multi-task scenarios and obtain CodeSeaXDataset, a dataset comprising 19,915 instruction instances across 4 code-related tasks, which is aimed at improving the generalization ability of Code LLM. Our experiments demonstrate that WaveCoder models significantly outperform other open-source models in terms of the generalization ability across different code-related tasks. Moreover, WaveCoder-Ultra-6.7B presents the state-of-the-art generalization abilities on a wide range of code-related tasks.

6/10/2024

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

5/30/2024