MAmmoTH2: Scaling Instructions from the Web

2405.03548

Published 5/24/2024 by Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

🌀

Abstract

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

Create account to get full access

Overview

This paper explores a new approach to improve the reasoning abilities of large language models (LLMs) through instruction tuning.
The key factors for successful instruction tuning are data quality and scalability.
The researchers propose a method to efficiently harvest 10 million naturally occurring instruction data from the pre-training web corpus, without relying on costly human annotation or GPT-4 distillation.
The resulting MAmmoTH2 models show significant performance boosts on reasoning benchmarks, outperforming previous instruction tuning approaches.
Further training on public instruction tuning datasets yields the MAmmoTH2-Plus model, which achieves state-of-the-art results on various reasoning and chatbot tasks.

Plain English Explanation

The paper describes a new way to make large language models better at reasoning and problem-solving. The key is instruction tuning, which means training the models on a large number of examples of instructions and the corresponding responses.

The researchers found that the quality and quantity of the instruction data are crucial for this process. Most previous approaches relied on either human-sourced data or distilling from the powerful GPT-4 model. However, both of these methods can be expensive and time-consuming.

Instead, the researchers developed a clever way to automatically extract high-quality instruction-response pairs from the internet data used to train the original language models. This involved recalling relevant documents, extracting the pairs, and then refining them using open-source language models.

The resulting MAmmoTH2 models showed a big boost in their ability to reason and solve problems, without needing any additional training on specialized datasets. For example, the performance on a math reasoning test went up from 11% to 34%, and on a common sense reasoning test from 36% to 67%.

By training the MAmmoTH2 models further on public instruction datasets, the researchers were able to create the even more capable MAmmoTH2-Plus model, which sets new state-of-the-art results on several reasoning and chatbot benchmarks.

Overall, this work demonstrates a new, more efficient way to build high-quality instruction tuning data for improving the reasoning abilities of large language models, without the need for expensive human labeling or distillation from the most advanced models.

Technical Explanation

The key steps of the researchers' approach are:

Recalling relevant documents: The researchers first retrieve a large set of potentially relevant documents from the pre-training web corpus, using an information retrieval system.
Extracting instruction-response pairs: They then extract pairs of instructions and the corresponding responses from the recalled documents, using heuristics and open-source language models to identify the relevant text.
Refining the extracted pairs: Finally, the extracted pairs are further refined using additional filtering and scoring with open-source LLMs, to ensure high quality.

This process allows the researchers to efficiently harvest a dataset of 10 million naturally occurring instruction-response pairs, without the need for costly human annotation or distillation from GPT-4.

The researchers then fine-tune base LLMs on this dataset, creating the MAmmoTH2 model family. Experiments show that the MAmmoTH2 models significantly outperform previous instruction tuning approaches on a range of reasoning benchmarks, such as MATH and GSM8K.

Further training the MAmmoTH2 models on public instruction tuning datasets, such as those used in the TextSquare project, yields the MAmmoTH2-Plus model, which achieves state-of-the-art performance on several reasoning and chatbot tasks.

Critical Analysis

The researchers acknowledge that their approach has some limitations. For example, the quality of the extracted instruction-response pairs is still dependent on the accuracy of the retrieval and filtering heuristics, and could be further improved.

Additionally, the researchers did not directly compare the performance of the MAmmoTH2 models to those trained on human-annotated or GPT-4 distilled data. It would be interesting to see how the different data sources and approaches compare in terms of both model performance and the diversity/quality of the resulting instruction-following capabilities.

Further research could also explore ways to enhance the generalization abilities of the MAmmoTH2 models, beyond just the reasoning benchmarks considered in this paper. Investigating the model's robustness, common sense understanding, and ability to follow open-ended instructions would provide a more comprehensive evaluation.

Overall, this work presents a promising new approach to instruction tuning that could significantly improve the reasoning and problem-solving abilities of large language models, without the need for costly data collection and curation processes.

Conclusion

This paper introduces a novel paradigm for efficiently harvesting high-quality instruction-response data from the web, to enhance the reasoning capabilities of large language models through instruction tuning.

By automatically extracting and refining 10 million natural instruction examples, the researchers were able to create the MAmmoTH2 models, which outperformed previous instruction tuning approaches on a range of reasoning benchmarks. Further training on public instruction datasets led to the state-of-the-art MAmmoTH2-Plus model.

This work demonstrates the potential for scalable, high-quality instruction data to be a key driver for improving the general intelligence of large language models, without the need for costly human annotation or distillation from the most advanced models. The insights from this research could pave the way for more efficient and effective instruction tuning approaches in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han, Hui Chen, Soujanya Poria

Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

6/17/2024

cs.CL cs.AI

✅

Instruction Tuning With Loss Over Instructions

Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani

Instruction tuning plays a crucial role in shaping the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can effectively improve the LM performance on both NLP tasks (e.g., MMLU, TruthfulQA, and HumanEval) and open-ended generation benchmarks (e.g., MT-Bench and AlpacaEval). Remarkably, in the most advantageous case, IM boosts model performance on AlpacaEval 1.0 by over 100%. We identify two key factors influencing the effectiveness of IM: (1) The ratio between instruction length and output length in the training data; and (2) The number of training examples. We observe that IM is especially beneficial when trained on datasets with lengthy instructions paired with brief outputs, or under the Superficial Alignment Hypothesis (SAH) where a small amount of training examples are used for instruction tuning. Further analysis substantiates our hypothesis that the improvement can be attributed to reduced overfitting to instruction tuning datasets. Our work provides practical guidance for instruction tuning LMs, especially in low-resource scenarios.

5/24/2024

cs.CL cs.AI

💬

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran, Zhichao Yang, Zonghai Yao, Hong Yu

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

6/10/2024

cs.CL cs.AI

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

4/22/2024

cs.CV cs.LG