Adapting Large Language Models via Reading Comprehension

Read original: arXiv:2309.09530 - Published 7/26/2024 by Daixuan Cheng, Shaohan Huang, Furu Wei

💬

Overview

The researchers explore how continued pre-training on domain-specific corpora affects large language models.
They find that while pre-training on raw domain-specific data provides the model with relevant knowledge, it can significantly hurt its ability to answer questions based on that knowledge.
Inspired by how humans learn through reading comprehension, the researchers propose a method to transform raw corpora into reading comprehension texts, which enhances model performance across various tasks in different domains.
Their approach is highly scalable and applicable to any pre-training corpora.
The researchers demonstrate that their domain-specific reading comprehension texts can also improve a model's performance on general benchmarks, suggesting the potential to develop a general model across multiple domains.

Plain English Explanation

The researchers wanted to understand how training large language models on domain-specific data, such as texts about medicine or finance, would affect the models' performance. They found that while this pre-training gave the models a lot of knowledge about the specific domain, it actually made it harder for them to answer questions based on that knowledge.

To address this, the researchers took inspiration from how humans learn. When people read something, they often improve their ability to answer questions about it if they also practice comprehension activities related to the content. So the researchers developed a way to transform raw domain-specific texts into reading comprehension exercises, with questions and other tasks to help the language model better learn and apply the information.

This approach consistently improved the model's performance on various tasks in different domains, like medicine, finance, and law. Interestingly, the researchers also found that using these domain-specific reading comprehension texts could boost the model's performance on general benchmarks, suggesting the potential to develop a single language model that works well across many different areas.

The researchers have made their model, code, and data available online for others to use and build upon.

Technical Explanation

The researchers explored the impact of continued pre-training on domain-specific corpora for large language models. They found that while pre-training on raw domain-specific data [link to "using-pretrained-large-language-model-prompt-engineering"] endows the model with relevant knowledge, it can drastically hurt its ability to answer questions based on that knowledge.

To address this, they were inspired by how humans learn through reading comprehension - practicing questions and activities after reading improves one's ability to apply the learned knowledge. The researchers proposed a method to transform raw corpora into reading comprehension texts, where each text is enriched with a series of tasks related to its content. This approach is highly scalable and applicable to any pre-training corpora.

The researchers' method consistently enhanced performance across various tasks in three different domains: biomedicine, finance, and law. Notably, their 7B language model achieved competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B [link to "comprehensive-study-german-language-models-clinical-biomedical"].

Furthermore, the researchers demonstrated that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, suggesting the potential to develop a general model across even more domains [link to "can-llms-augment-low-resource-reading-comprehension"].

Critical Analysis

The researchers' approach of transforming raw corpora into reading comprehension texts is a promising solution to the challenge of endowing language models with domain-specific knowledge while maintaining their ability to apply that knowledge effectively. However, the paper does not provide a detailed analysis of the limitations of this method.

One potential concern is the scalability of generating high-quality reading comprehension tasks for large-scale corpora. The researchers mention that their approach is highly scalable, but the process of creating appropriate questions and activities for each text may become increasingly challenging as the corpus size grows.

Additionally, the paper does not explore the potential biases or representational issues that may arise from the specific reading comprehension tasks used. The choice of tasks and the way they are designed could inadvertently introduce biases or skew the model's understanding of the domain.

Further research could investigate the robustness of this approach across a wider range of domains, as well as the long-term impacts on the model's generalization abilities. Exploring the trade-offs between domain-specific and general performance would also be an important area for future work.

Conclusion

The researchers have proposed a novel approach to address the challenge of endowing large language models with domain-specific knowledge while maintaining their ability to apply that knowledge effectively. By transforming raw corpora into reading comprehension texts, their method consistently enhances performance across various tasks in different domains, including biomedicine, finance, and law.

Notably, the researchers have demonstrated that their approach can enable a smaller language model to achieve competitive performance with much larger, domain-specific models. This suggests the potential to develop a general language model that performs well across a wide range of domains, which could have significant implications for the field of natural language processing and its applications in various industries.

The researchers have made their model, code, and data publicly available, allowing others to build upon their work and explore the further potential of this approach. As the field of large language models continues to evolve, this research represents an important step towards developing more versatile and effective models that can be applied to a diverse range of real-world problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Adapting Large Language Models via Reading Comprehension

Daixuan Cheng, Shaohan Huang, Furu Wei

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data are available at https://github.com/microsoft/LMOps.

7/26/2024

💬

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

4/17/2024

Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain

Zhen wan, Yating Zhang, Yexiang Wang, Fei Cheng, Sadao Kurohashi

While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data. This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an textbf{adapt-retrieve-revise} process. The initial step is to textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and textbf{revise} the draft answer to generate the final answer. Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4% and 23.9%. Our code will be released

8/27/2024

💬

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian, Viktor Schlegel, Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Vijay Prakash Dwivedi, Stefan Winkler

There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.

6/7/2024