InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

Read original: arXiv:2310.07713 - Published 5/30/2024 by Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro

🤿

Overview

This paper introduces Retro 48B, the largest language model pretrained with retrieval capabilities.
Retro 48B, a 43B parameter GPT model, is further pretrained using the Retro augmentation method, which retrieves information from a large external database.
The authors demonstrate that Retro 48B outperforms a standard 43B GPT model in terms of perplexity and performance on a variety of zero-shot tasks, with only a modest increase in training time.
The paper also shows that the decoder of Retro 48B can be used as a standalone model, achieving comparable results to the full Retro 48B architecture.

Plain English Explanation

The research paper discusses a new, very large language model called Retro 48B that has been pretrained using a technique called "retrieval augmentation". Retrieval-augmented language models are designed to improve the accuracy and factual knowledge of language models by allowing them to access external information sources during training and inference.

Retro 48B is a 43 billion parameter model that was further trained on an additional 100 billion tokens of text data, with the ability to retrieve relevant information from a 1.2 trillion token database. This gave the model a significant boost in performance compared to a standard 43 billion parameter GPT model trained on the same 1.2 trillion token dataset.

The authors show that Retro 48B outperforms the standard GPT model on a variety of zero-shot tasks, including question answering, reading comprehension, and summarization. Remarkably, they also found that you can remove the "encoder" part of the Retro 48B model and just use the "decoder" part, while still achieving comparable results. This suggests that the pretraining process has imbued the decoder with powerful capabilities that can be leveraged even without the full Retro architecture.

Overall, this research demonstrates the benefits of scaling up retrieval-augmented language models and provides a new, highly capable model that can be used for a variety of natural language processing tasks. It's an exciting step towards building even larger and more capable language models that can better leverage external information sources.

Technical Explanation

The key technical contributions of this paper are:

Pretraining Retro 48B: The authors start with a 43 billion parameter GPT model and further pretrain it using the Retro augmentation method. This involves retrieving relevant information from a 1.2 trillion token database and incorporating that into the training process. This results in the Retro 48B model, which has 48 billion total parameters.
Improved Performance: Retro 48B significantly outperforms the standard 43 billion parameter GPT model in terms of perplexity, with only a 2.58% increase in GPU training time. This demonstrates the scaling potential of the Retro pretraining approach.
Instruction Tuning: The authors further fine-tune Retro 48B using "instruction tuning", which involves training the model to follow natural language instructions. This results in the "InstructRetro" model, which shows large performance gains over an instruction-tuned standard GPT model on a variety of zero-shot tasks.
Decoder-Only Architecture: Surprisingly, the authors find that you can remove the encoder from the InstructRetro architecture and just use the decoder, while still achieving comparable results. This suggests that the Retro pretraining process has imbued the decoder with powerful capabilities that can be leveraged independently.

The key insights from this work are that scaling up retrieval-augmented language models can lead to significant performance gains, and that the benefits of this approach can be captured in the model's decoder, enabling more efficient and flexible model architectures. This aligns with other recent research on optimizing the training and deployment of large language models.

Critical Analysis

The paper presents a compelling demonstration of the benefits of retrieval augmentation for pretraining large language models. However, there are a few potential limitations and areas for further research:

Database Quality and Curation: The performance gains of Retro 48B likely depend heavily on the quality and relevance of the 1.2 trillion token database used for retrieval. The authors do not provide detailed information about how this database was curated and filtered, which could be an important factor.
Computational Cost: While the authors claim the Retro pretraining only incurred a 2.58% increase in GPU hours, the total computational resources required to train a 48 billion parameter model are still substantial. The scalability and accessibility of this approach for most researchers and organizations may be limited.
Alignment and Safety: As language models continue to grow in size and capability, there are increasing concerns about their potential misuse and the need for robust safety and alignment mechanisms. The authors do not address these important considerations in this work.
Generalization Beyond Benchmarks: The paper focuses on evaluating Retro 48B and InstructRetro on a variety of academic benchmarks. More research is needed to understand how these models perform on real-world, open-ended tasks and their ability to generalize to novel scenarios.

Overall, this research represents an exciting step forward in the development of large, capable language models. However, it will be important for future work to address the potential limitations and broader societal implications of these powerful AI systems.

Conclusion

The Retro 48B model introduced in this paper demonstrates the significant performance gains that can be achieved by pretraining large language models with retrieval augmentation. By leveraging a large external database during the pretraining process, the authors were able to create a 48 billion parameter model that outperforms a standard 43 billion parameter GPT model on a variety of zero-shot tasks.

This research highlights the potential of scaling up retrieval-augmented language models to unlock new levels of language understanding and reasoning. The finding that the model's decoder can be used as a standalone component is particularly intriguing, as it suggests that the pretraining process has imbued the model with broadly applicable capabilities.

As language models continue to grow in size and complexity, it will be crucial to address the potential risks and limitations of these systems, such as concerns around safety, alignment, and real-world generalization. However, this work from the Retro 48B team represents an important step forward in building even larger and more capable language models that can leverage external information sources to enhance their performance and capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro

Pretraining auto-regressive large language models~(LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://huggingface.co/nvidia/retro-48b-instruct-4k.

5/30/2024

GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev

Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process but GPT models have higher performance potential with PEFT. Additionally, our study indicates that 8B parameter models strike an optimal balance between cost and performance and P-tuning lags behind other PEFT techniques. We further provide a comparative analysis of between applying PEFT to an Instruction-tuned RETRO model and base RETRO model. This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models, highlighting their relative performance.

7/8/2024

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Ohad Rubin, Jonathan Berant

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

7/23/2024

Retrieval-augmented code completion for local projects using large language models

Marko Hostnik, Marko Robnik-v{S}ikonja

The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using LLMs with around 160 million parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two models based on the transformer architecture, the generative model GPT-2 and the retrieval-adapted RETRO model, on open-source Python files, and empirically evaluate and compare them, confirming the benefits of vector embedding based retrieval. Further, we improve our models' performance with In-context retrieval-augmented generation, which retrieves code snippets based on the Jaccard similarity of tokens. We evaluate In-context retrieval-augmented generation on larger models and conclude that, despite its simplicity, the approach is more suitable than using the RETRO architecture. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.

8/12/2024