Scaling Granite Code Models to 128K Context

Read original: arXiv:2407.13739 - Published 7/19/2024 by Matt Stallone, Vaibhav Saxena, Leonid Karlinsky, Bridget McGinn, Tim Bula, Mayank Mishra, Adriana Meza Soria, Gaoyuan Zhang, Aditya Prasad, Yikang Shen and 12 others

Scaling Granite Code Models to 128K Context

Overview

This paper presents a new approach for scaling Granite code models to a 128K context length, which allows them to process and generate much longer text compared to previous models.
The authors introduce several key innovations, including efficient long context embeddings, a novel hierarchical transformer architecture, and techniques for training large models on limited computational resources.
The resulting models demonstrate strong performance on a range of long-form code generation and understanding benchmarks, outperforming previous state-of-the-art approaches.

Plain English Explanation

The paper describes a new way to build AI models that can work with much longer pieces of code or text than before. Previous AI models for code and language were limited in the amount of context they could handle, often only a few thousand words.

The researchers developed several new techniques to allow their "Granite" models to work with up to 128,000 words of context. This is a huge increase that opens up new possibilities for more complex and nuanced language understanding and generation.

Some of the key innovations include:

Efficient long context embeddings: A new way to efficiently represent long pieces of text so the model can understand them.
Hierarchical transformer architecture: A model design that can effectively process extremely long inputs.
Techniques for training large models on limited computational resources: Methods to make it feasible to train these big models even on moderate hardware.

The resulting Granite models showed strong performance on benchmarks testing their ability to generate and understand long-form code and text. This represents a significant advance in the field of natural language AI.

Technical Explanation

The paper introduces a new family of "Granite" code models that can handle context lengths up to 128,000 tokens, a major increase over prior state-of-the-art models.

Key technical innovations include:

Efficient long context embeddings: The authors propose a new embedding scheme that can compactly represent long sequences of text, enabling the model to process vast contexts efficiently.
Hierarchical transformer architecture: The model uses a multi-scale transformer design, with higher-level transformers operating on compressed representations of lower-level context. This allows the model to effectively capture long-range dependencies.
Training techniques for large models: The authors demonstrate methods to train these massive models on limited computational resources, including gradient checkpointing and mixed precision training.

Experiments show the Granite models outperforming prior work on a range of long-form code generation and understanding benchmarks. The Granite model family also demonstrates strong zero-shot transfer capabilities, suggesting the models have learned powerful general representations.

Critical Analysis

The paper makes a compelling case for the importance of long context modeling and presents a promising new approach. However, some potential limitations and areas for further research are worth considering:

The experiments focus primarily on code-related tasks, so it's unclear how well the models would generalize to other long-form domains like books, academic papers, or lengthy dialogues.
The authors acknowledge that training these massive models is computationally intensive, which could limit their accessibility and practical deployability, especially for smaller organizations.
While the hierarchical transformer architecture is innovative, its performance may be sensitive to hyperparameter choices and architectural details that are not fully explored in the paper.
The authors do not provide much analysis on the types of long-range dependencies the Granite models are able to capture, nor do they investigate potential biases or failure modes of the system.

Overall, this work represents a valuable contribution to the field of long context modeling, but further research is needed to fully understand the capabilities and limitations of this approach.

Conclusion

The Granite code model family introduced in this paper demonstrates a significant advance in the ability of AI systems to process and generate long-form text and code. By innovating on efficient long context embeddings, hierarchical transformer architectures, and training techniques for large models, the authors have pushed the boundaries of what is possible with language AI.

While there are still some open questions and areas for further research, the strong performance of the Granite models on long-form benchmarks suggests they could enable a wide range of new applications, from more sophisticated code assistants to AI systems capable of engaging in nuanced, extended dialogue. This work represents an important step forward in the quest to build AI systems that can truly understand and interact with the world at human-level scales of context and complexity.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Scaling Granite Code Models to 128K Context

Matt Stallone, Vaibhav Saxena, Leonid Karlinsky, Bridget McGinn, Tim Bula, Mayank Mishra, Adriana Meza Soria, Gaoyuan Zhang, Aditya Prasad, Yikang Shen, Saptha Surendran, Shanmukha Guttula, Hima Patel, Parameswaran Selvam, Xuan-Hong Dang, Yan Koyfman, Atin Sood, Rogerio Feris, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.

7/19/2024

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, Yimeng Gan, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context language modeling, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent. LongSkywork achieves outstanding performance on a variety of long-context benchmarks. In the Needle test, a benchmark for long-context information retrieval, our models achieved perfect accuracy across multiple context spans. Moreover, in realistic application scenarios, LongSkywork-13B demonstrates performance on par with Claude2.1, the leading long-context model, underscoring the effectiveness of our proposed methods.

6/4/2024

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Yao Fu

Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.

5/16/2024

🤔

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, Manish Sethi, Xuan-Hong Dang, Pengyuan Li, Kun-Lung Wu, Syed Zawad, Andrew Coleman, Matthew White, Mark Lewis, Raju Pavuluri, Yan Koyfman, Boris Lublinsky, Maximilien de Bayser, Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Yi Zhou, Chris Johnson, Aanchal Goyal, Hima Patel, Yousaf Shah, Petros Zerfos, Heiko Ludwig, Asim Munawar, Maxwell Crouse, Pavan Kapanipathi, Shweta Salaria, Bob Calio, Sophia Wen, Seetharami Seelam, Brian Belgodere, Carlos Fonseca, Amith Singhee, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.

5/8/2024